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Preface 


System identification is concerned with estimating models of dynamical systems 
based on observed input and output signals. The term was coined in 1953 by Lotfi 
Zadeh, but various approaches had of course been suggested before that. One can 
distinguish two major routes in the development of system identification: (1) A statis- 
tical route relying on parameter estimation techniques such as Maximum Likelihood 
and (2) a realization route, based on techniques to realize (linear) dynamical systems 
from input/output descriptions, such as impulse responses. The literature on this in 
the past 70 years is extensive and impressive. 

Mathematically, system identification is an inverse problem and may suffer from 
numerical instability. The Russian researcher Tikhonov suggested in the 1940s a 
general way to curb the number of solutions for inverse problems which he called 
regularization. A simple regularization method applied to linear regression became 
known as ridge regression. Regularized system identification was for a long time 
used as a term for ridge regression. 

Around 2000 other ideas were put forward for achieving regularization. They 
had links to general function estimation with mathematical foundations in Repro- 
ducing Kernel Hilbert Spaces (RKHS) and kernel techniques. This resulted in 
intense research and extensive publications in the past 25 years. Regularized system 
identification has become also known as the kernel approach to identification. 

It is the purpose of this book to give a comprehensive overview of this develop- 
ment. A flow diagram of the book’s chapters is given in Fig. 1. It starts with the core 
of the regularization idea: To accept some bias in the estimates to achieve a smaller 
variance error and a better overall Mean Square Error (MSE) of the model. This is 
illustrated with the Stein effect discussed in Chap. 1. 

Traditional System identification (the statistical route) is surveyed in Chap. 2. 
An archetypical model structure is the linear regression and Chap. 3 explains how 
regularization is handled in such models, while the Bayesian interpretation of this is 
given in Chap. 4. The linear regression perspective is lifted to general linear models 
of dynamical systems in Chap. 5. 
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Fig. 1 Chapter Dependencies The first two chapters are introductory. They review the bias-variance 
trade-off, discussing the James—Stein estimator, and the classical approach to linear system identi- 
fication. Regularized kernel-based approaches to linear system identification in finite-dimensional 
spaces are developed in Chaps. 3—5. The reader can directly skip to Chap. 9 where such techniques are 
illustrated via numerical experiments and real-world cases. A different flow to reach the final chapter 
moves along Chaps. 6-8 where regularization in reproducing kernel Hilbert spaces is described. 
These parts of the book address estimation of infinite-dimensional (discrete- or continuous-time) 
linear models and nonlinear system identification 


With this, the basic techniques of practical regularization for linear models have 
been outlined and the readers may continue directly to Chap. 9 for numerical 
experiments and practical applications. 

Chapters 6 and 7 lift the mathematical foundation of regularization with a treat- 
ment of how the techniques fit into the framework of RKHS, while Chap. 8 deals 
with applications to nonlinear models. 

Sections marked with the symbol » contain quite technical material which can be 
skipped without interrupting the reading. Proofs of some of the theorems contained 
in the book are gathered in the Appendix present at the end of each chapter. 
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Chapter 1 A) 
Bias ae 


Abstract Adopting a quadratic loss, the performance of an estimator can be mea- 
sured in terms of its mean squared error which decomposes into a variance and a 
bias component. This introductory chapter contains two linear regression examples 
which describe the importance of designing estimators able to well balance these two 
components. The first example will deal with estimation of the means of indepen- 
dent Gaussians. We will review the classical least squares approach which, at first 
sight, could appear the most appropriate solution to the problem. Remarkably, we 
will instead see that this unbiased approach can be dominated by a particular biased 
estimator, the so-called James—Stein estimator. Within this book, this represents the 
first example of regularized least squares, an estimator which will play a key role in 
subsequent chapters. The second example will deal with a classical system identifi- 
cation problem: impulse response estimation. A simple numerical experiment will 
show how the variance of least squares can be too large, hence leading to unacceptable 
system reconstructions. The use of an approach, known as ridge regression, will give 
first simple intuitions on the usefulness of regularization in the system identification 
scenario. 


1.1 The Stein Effect 


Consider the following “basic” statistical problem. Starting from the realizations of 
N independent Gaussian random variables y; ~ M (0i, o°’), our aim is to reconstruct 
the means 6;, contained in the vector 0 seen as a deterministic but unknown parameter 
vector.! The estimation performance will be measured in terms of mean squared 
error (MSE). In particular, let £ and || - || denote expectation and Euclidean norm, 
respectively. Then, given an estimator 6 of an N-dimensional vector @ with ith 


1 In future chapters, 0 will be used to denote the true value of the deterministic vector that has gen- 
erated the data, distinguishing it from the vector which parametrizes the model. In this introductory 
chapter, 0 is instead used in both the cases to maintain the notation as simple as possible. 
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component 6;, one has 


MSE; = &\6 — 01? 


N N 
=J eÂ - 66)? + O - ei, a.D 
i=l i=l 
Variance Bias? 


where in the last passage we have decomposed the error into two components. The 
first one is the variance of the estimator while the difference between the mean and the 
true parameter values measures the bias. If the mean coincides with 6, the estimator 
is said to be unbiased. The total error thus has two contributions: the variance and 
the (squared) bias. 

Note that the mean estimation problem introduced above is a simple instance 
of linear Gaussian regression. In fact, letting I be the N x N identity matrix, the 
measurements model is 


Y=0+E, E~N(0,07ly), (1.2) 


where Y is the N-dimensional (column) vector with ith component y;. The most 
popular strategy to recover 6 from data is least squares which also corresponds to 
maximum likelihood in this Gaussian scenario. The solution minimizes 


IY — alr? 


and is then given by ; 
aes 


Apparently, the obtained estimator is the most reasonable one. A first intuitive argu- 
ment supporting it is the fact that the random variables {y;} j+; seem unable to carry 
any information on 6;, since all the noises e; are independent. Hence, the natural esti- 
mate of 6; appears indeed its noisy observation y;. This estimator is also unbiased: 
for any 0 we have 


& (a) =e, 


Hence, from (1.1) we see that the MSE coincides with its variance, which is constant 
over 0 and given by 

MSE_s = 6\\6"5 — 6||? = No’. 
According to Markov’s theorem 6£5 is also efficient. This means that its variance 
is equal to the Cramér—Rao limit: no unbiased estimate can be better than the least 
squares estimate, e.g., see [9, 17]. 
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1.1.1 The James—Stein Estimator 


By introducing some bias in the inference process, itis easy to obtain estimators which 
dominate strictly least squares (in the MSE sense) over certain parameter regions. 
The most trivial example is the constant estimator 6 = a. Its variance is null, so that 
its MSE reduces to the bias component ||@ — a||?. Hence, even if the behaviour of 
6 is unacceptable in most of the parameter space, this estimator outperforms least 
squares in the region 

{0 s.t. || — all? < No’}. 


Note a feature common to least squares and the constant estimator. Both of them 
do not attempt to trade bias and variance, they just set to zero one of the two MSE 
components in (1.1). An alternative route is the design of estimators which try to 
balance bias and variance. Rather surprisingly, we will now see that this strategy can 
dominate 645 over the entire parameter space. 

The first criticisms about least squares were introduced by Stein in the *50s [23] 
and can be so summarized. A good mean estimator 6 should also lead to a good 
estimate of the Euclidean norm of 0. Thus, one should have 


ll = (10. 


But, if we consider the “natural” estimator 645 = 


the errors e;, one obtains 


Y, in view of the independence of 
EYI? = No? + 01. 


This shows that the least squares estimator tends to overestimate || ||. It thus seems 
desirable to correct 648 by shrinking the estimate towards the origin, e.g., adopting 
estimators of the form 645 (1 — r), where r is a positive scalar. The most famous 
example is the James—Stein estimator [15] where r is determined from data as follows: 


(N — 2)o? 
IY II? 


’ 


hence leading to 
2 
Ais y_ (N — 2)o 
IY II? 
Note that, even if all the components of Y are mutually independent, 6s exploits 
all of them to estimate each 6;. The surprising outcome is that 07S outperforms 64° 
over all the parameter space, as illustrated in the next theorem. 


Theorem 1.1 (James—Stein’s MSE, based on [15]) Consider N Gaussian and inde- 
pendent random variables y; ~ N (6;,07). Let also 6/5 denote the James—Stein 
estimator of the means, i.e., 
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Fig. 1.1 Estimation of the mean 0 € R!° of a Gaussian with covariance equal to the identity matrix. 
The plot displays the mean squared error of least squares (M S Ez s) and of the James—Stein estimator 
(MSE js), including its bias-variance decomposition, as a function of 6; with 6 = [01 0...0] 


gis — y _ N= 2)0° 
IYI? 


Then, if N > 3, the MSE of 6/° satisfies 
MSE;s < No? VO. 


We say that an estimator dominates another estimator if for all the 6 its MSE is 
not larger and for some @ it is smaller. In statistics an estimator is then said to be 
admissible if no other estimator exists that dominates it in terms of MSE. The above 
theorem then shows that the least squares estimator of the mean of a multivariate 
Gaussian is not admissible if the dimension exceeds two. The reason is that, even 
when the Gaussians are independent, the global MSE can be reduced uniformly by 
adding some bias to the estimate. This is also graphically illustrated in Fig. 1.1 where 
MSE 5, along with its decomposition, is plotted as a function of the component 6, 
of the ten-dimensional vector 0 = [6, 0...0] (noise variance is equal to one). One 
can see that MSE75 < MSEzLs since the bias introduced by 6/5 is compensated by 
a greater reduction in the variance of the estimate. Note however that James—Stein 
improves the overall MSE and not the individual errors affecting the 6;. This aspect 
can be important in certain applications where it is not desirable to trade a higher 
individual MSE for a smaller overall MSE. 
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It is easy to check that the James—Stein estimator admits the following interesting 


reformulation: . 
6’ = arg min |Y — 6 + y 1101? 
1 (1.3) 


l+y’ 


where the positive scalar y is determined from data as follows: 


o N-D? 
Y = YE- 0N- 202° 


(1.4) 


Equation (1.3) thus reveals that 6/Sisa particular version of regularized least squares, 
an estimator which will play a central role in this book. In particular, the objective 
in (1.3) contains two contrasting terms. The first one, || Y — 6||*, is a quadratic loss 
which measures the adherence to experimental data. The second one, ||6||*, is a 
regularizer which shrinks the estimate towards the origin by penalizing the energy of 
the solution. The role of the regularization parameter y is then to balance these two 
components via a simple scalar adjustment. Equation (1.4) shows that James—Stein’s 
strategy is to set its value to the inverse of an estimate of the signal-to-noise ratio. 


1.1.2 Extensions of the James—Stein Estimator x 


We have seen that the James—Stein estimator corrects each component of 64s shifting 
it towards the origin. This implies that the MSE improvement will be better when 
the components of 6 are close to zero. Actually, there is nothing special in the origin. 
If the true @ is expected to be close to a € RY, one can modify the original Ê$ as 
follows: 
Bic (N — 2)o? ( 
IY — all? 


The result is an estimator which still dominates least squares, with the origin’s role 
now played by a. The estimator thus concentrates the MSE improvement around a. 

Now, let us consider a non-orthonormal scenario where Gaussian linear regression 
now amounts to estimating 6 from the N measurements 


Yi = dii +e; ei ~ M (0,1), 
with all the noises e; mutually independent. The least squares (maximum likelihood) 


estimator is now y 
^ i . 
gS == j=1,...,N, 


and its MSE is the sum of the variances of Ls ie., 
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Note that the MSE can be large when just one of the d; is small. In this case, the 
problem is said to be ill-conditioned: even a moderate measurement error can lead 
to a large reconstruction error. 

Also in this non-orthonormal scenario, it is possible to design estimators whose 
MSE is uniformly smaller than M SEzs. The number of possible choices is huge, 
depending on which region of the parameter space one wants to concentrate the 
improvement. There is however an important limitation shared by all of Stein-type 
estimators: in general they are not much effective against ill-conditioning. This is 
illustrated in the following example. It illustrates an estimator whose negative features 
are well representative of some drawbacks of Stein’s estimation in non-orthogonal 
settings. 


Example 1.2 (A generalization of James—Stein) Consider the estimator Ô whose ith 
component is given by 


’ N-2 ,]y; 
ashi- =a]? i=1,...,N, (1.5) 


where 


It is now shown that 6 is a generalization of James—Stein able to outperform least 
squares over the entire parameter space. In fact, defining 


N-2 
hi(Y) = =d i, 


after simple computations we obtain 


2 X Qi- “en h? (Y) 


i=1 i=1 i 


1 N 13h) h2? (Y) 
“Gp” ie Dyp eS d2 |: 


əyi i=1 i 


MSE; 


where the last equality comes from Lemma 1.1 reported in Sect. 1.4. Since 


ahi (Y N-2 4N-2 
a= d? +d; 
ðyi S S? 


2y, 
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one has 


pene X KRY) 
Poraa] 


Oy; i=l 
2(N — -2n zl 2) 252 4 (N -= 2)" 2)? 2,2 
+ 2d; y d; y 
K Dapit SY ae 
(N — i 
—& —— 5 <0 


which implies 
MSE; < MSEjpis VO. 


However, assume that the problem is ill-conditioned. Then, if one d; is small and the 
values of d; are quite spread, we could well have d? /S ~ 0. Hence, (1.5) essentially 


reduces to one 
6; =li a d? Yi P Ji 
S di di 


which is the least squares estimate of 0;. This means that the signal components 
mostly influenced by the noise, i.e., associated with small d;, are not regularized. 
Thus, in presence of ill-conditioning, 6 will likely return an estimate affected by 
large errors. 


1.2 Ridge Regression 


Consider now one of the fundamental problems in system identification. The task 
is to estimate the impulse response g° of a discrete-time, linear and causal dynamic 
system, starting from noisy output data. The measurements model is 


y(t) = D> gut- k) + e(t), t=1,...,N, (1.6) 


where ¢ denotes time, the sampling interval is one time unit for simplicity, the g? 
indicate the impulse response coefficients, u(t) is the known system input while e(t) 
is the noise. 

To determine the impulse response from input—output measurements, one of the 
main questions is how to parametrize the unknown g°. The classical approach, which 
will be also reviewed in the next chapter, introduces a collection of impulse response 
models g(@), each parametrized by a different vector 6. In particular, here we will 
adopt an FIR model of order m, i.e., g,(0) = 0 fork = 1,..., m and zero elsewhere. 
This permits to reformulate (1.6) as a linear regression: we stack all the elements 
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y(t) and e(t) to form the vectors Y and E and obtain the model 
Y = 0 +E 


with the regression matrix P € RY*%™ given by 


u(0) u(—1) u(—2) ... u(—m + 1) 
u(1) u(0) u(—1) ... + u(—m) 


u(N — 1) u(N — 2) u(N — 3) ... u(N — m) 
We can now use least squares to estimate 0. Assuming ®7 @ of full rank, we obtain 


645 = arg min, ||Y — D0 ||? (1.7a) 
= (Tp) ITY. (1.7b) 


Note that the impulse response estimate is function of the FIR order which corre- 
sponds to the dimension m of 6. The choice of m is a trade-off between bias (a large 
m is needed to describe slowly decaying impulse responses without too much error) 
and variance (large m requires estimation of many parameters leading to large vari- 
ance). This can be illustrated with a numerical experiment. The unknown impulse 
response g° is defined by the following rational transfer function: 


(z+ 1)? 
z(z — 0.8)(z — 0.7)’ 


which, in practice, is equal to zero after less than 50 samples (g? is the red line in 
Fig. 1.3). We estimate the system from 1000 outputs corrupted by white and Gaussian 
noises e(t) of variance equal to the variance of the noiseless output divided by 50, see 
Fig. 1.2 (bottom panel). Data come from the system initially at rest and then fed at 
t = 0 with white noise low-pass filtered by z/(z — 0.99), see Fig. 1.2 (top panel). The 
reconstruction error is very large if we try to estimate g? with m = 50: linear models 
are easy to estimate but the drawback is that high-order FIR may suffer from high 
variance. Hence, it is important to select a model order which well balances bias and 
variance. To do that one needs to try different values of m then using some validation 
procedures to determine the “optimal” one. In this case, since the true g° is known, 
we can obtain the best value by selecting that m € [1,..., 50] which minimizes the 
MSE. This is an example of oracle-based procedure not implementable in practice: 
the optimal order is selected exploiting the knowledge of the true system. We obtain 
m = 18 which corresponds to MSE ,s5 = 70.7 and leads to the impulse response 
estimate displayed in Fig. 1.3. Even if the data set size is large and the signal-to-noise 
ratio is good, the estimate is far from satisfactory. The reason is that the low-pass 
input has poor excitation and leads to an ill-conditioned problem. This means that 
the condition number of the regression matrix ® is large so that also a small output 
error can produce a large reconstruction error. 
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Fig. 1.4 MSE of ridge 
regression and its 
bias-variance decomposition 
as a function of the 
regularization parameter 


An alternative to the classical paradigm, where different model structures are 
introduced, is the following straightforward generalization of (1.3), known as ridge 
regression [13, 14]: 


A 


ÔF = arg ming ||Y — 0|? + y 197 (1.8a) 
= (P7P +y Im) '®'Y, (1.8b) 


where we set m = 50 to solve our problem. Letting A = (67 @ + y Im)! ®T, it is 
easy to derive the MSE decomposition associated with 0°: 


MSEp =o" Trace(AA’) + || — A®6||". (1.9) 
— m nn 
Variance Bias? 


Figure 1.4 displays M S Ep for the particular system identification problem at hand 
as a function of the regularization parameter. Note that y plays the role of the model 
order in the classical scenario but can be tuned in a continuous manner to reach a 
good bias-variance trade-off. It is also interesting to see its influence on the variance 
and bias components. The variance is a decreasing function of the regularization 
parameter. Hence, its maximum is reached for y = 0 where ÔR reduces to the least 
squares estimator ĝLS given by (1.7) with m = 50. Instead, the bias increases with 
y. At the limit, for y —> 00, the penalty ||6||? is so overweighted that 6® becomes 
the constant estimator centred on the origin (it returns all null impulse response 
coefficients). 

In Fig. 1.5, we finally display the ridge regularized estimate with y set to the value 
minimizing the error and leading to MSEr = 16.8. It is evident that ridge regression 
provides a much better bias-variance trade-off than selecting the FIR order. 
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1.3 Further Topics and Advanced Reading 


Stein’s intuition on the development of an estimator able to dominate least squares in 
terms of global MSE can be found in [23], while the specific shape of 6/5 has been 
obtained in [15]. From then, a large variety of different estimators outperforming 
least squares, also under different losses, have been designed. It has been proved that 
there exists estimators which dominate James—Stein, even if the MSE improvement 
is not large, as described in [12, 16, 25]. Extensions and applications can be found 
in [5, 6, 11, 22, 24, 26]. A James—Stein version of the Kalman filter is derived in 
[18]. For interesting discussions on the limitations of Stein-type estimators in facing 
ill-conditioning see [8] but also [19] for new outcomes with better numerical stability 
properties. Other developments are reported in [7] where generalizations of Stein’s 
lemma are also described. 

The paper [10] describes connections between James-—Stein estimation and the so- 
called empirical Bayes approaches which will be treated later on in this book. The 
interplay between Stein-type estimators and the Bayes approach is also discussed in 
[2]. Here, one can also find an estimator which dominates least squares concentrating 
the MSE improvement in an ellipsoid that can be chosen by the user in the parameter 
space. This approach is deeply connected with robust Bayesian estimation concepts, 
e.g., see [1, 3]. 

The term ridge regression has been popularized by the works [13, 14]. This 
approach, introduced to guard against ill-conditioning and numerical instability, is 
an example of Tikhonov regularization for ill-posed problems. Among the first clas- 
sical works on regularization and inverse problems, it is worth already citing [4, 20, 
27-29]. A recent survey on the use of regularization for system identification can be 
instead found in [21]. The literature on this topic is huge and other relevant works 
will be cited in the next chapters. 
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1.4 Appendix: Proof of Theorem 1.1 


To discuss the properties of the James—Stein estimator, first it is useful to introduce 
a result which is a simplified version of Lemma3.2 reported in Chap.3, known as 
Stein’s lemma. 


Lemma 1.1 (Stein’s lemma, based on [24]) Consider N Gaussian and indepen- 
dent random variables yi ~ N (0i, o°). Fori =1,...,n, let also h : RN > R be 


a differentiable function such that € | a < oo. Then, it holds that 


E(y; — 6)h(Y) = oe, 


i 


Proof During the proof, we use £j; to denote the expectation conditional on {yj}; zi. 
Also, abusing notation, A(x) with x € R indicates the function h with ith argument 
set to x while the other arguments are set to y; j Æi. 

Note that, in view of the independence assumptions, each y; conditional on {y;} ji 
is still Gaussian with mean 6; and variance o”. Then, using integration by parts, one 
has 


Po (me) 7 f dha) exp(—@ = 0:)?/ Q0?) 
Iži ay J) J ôx Jara 
_ þr wE ereny 
7 Jiro 
+ (x — 0), exp(—(x — 0:)?/ (20°?) 
+f 52 h(x) ize dx 
t (x — 0), exp(—(x — 6;)?/(207)) 
I. z h(x) ae dx 


_ Ejzi (Qi — DAY) 
= : l 


—=00 


[oe] 


oO 


Note that the penultimate equality exploits the fact that h(x) exp(—(x — 0; y / (207)) 
must be infinitesimal as x — oo, otherwise the assumption & jn < œ would 


not hold. Using the above result, we obtain 


E (Qi — hY )) = £ [Ezi (Yi — AY) 


əðh(Y) 
= 06| a (Z2) 


oh(Y 
= ei 
OYi 


and this completes the proof. 
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We now show that the MSE of the James—Stein estimator is uniformly smaller 
than the MSE of least squares. One has 


MSEjs = £ (10 - 6/5(Y) I?) 


By a2 
= 6 (1 pee ri?) 


IYI? 
(N-20 3 r, (N — 2)0? 
= & | l0 — Y|? + IYI? +20 — Y)TY 
( IY |4 IY |? 
(N — 2)}?04 (N — 2)? 
No? +E ( +20 —- Y)T 
IY ||? IYI? 


As for the last term inside the expectation, exploiting Stein’s lemma with 


Yi WY) 1 y 
IYI yi VF 


e(& 7) 
IY |? 


hi(Y) = 


one has 


N 
(di (0; — i 


-e ($ CE a) 
ore ( a 


Using this equality in the MSE expression, we finally obtain 


MSE;s = No? +E (“ = 2)°o04 (N — rn 


IY? [IY ||? 


1 
= No? — (N — 2)°0*& < No’. 
IYI? 


References 


1. Berger JO (1980) A robust generalized Bayes estimator and confidence region for a multivariate 
normal mean. Ann Stat 8:716-761 

2. Berger JO (1982) Selecting a minimax estimator of a multivariate normal mean. Ann Stat 
10:81-92 


14 


RW 


28. 
29. 


1 Bias 


. Berger JO (1994) An overview of robust Bayesian analysis. Test 3:5-124 
. Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75:1-120 
. Bhattacharya PK (1966) Estimating the mean of a multivariate normal population with general 


quadratic loss function. Ann Math Stat 37:1819-1824 


. Bock ME (1975) Minimax estimators of the mean of a multivariate normal distribution. Ann 


Stat 3:209-218 


. Brandwein AC, Strawderman WE (2012) Stein estimation for spherically symmetric distribu- 


tions: recent developments. Stat Sci 27:11-23 


. Casella G (1980) Minimax ridge regression estimation. Ann Stat 8:1036—1056 
. Casella G, Berger R (2001) Statistical inference. Cengage Learning 
. Efron B, Morris C (1973) Stein’s estimation rule and its competitors - an empirical Bayes 


approach. J Am Stat Assoc 68(341):117—130 


. Greenberg E, Webster CE (1983) Advanced econometrics: a bridge to the literature. Wiley 
. Guo YY, Pal N (1992) A sequence of improvements over the James—Stein estimator. J Multivar 


Anal 42:302-317 


. Hoerl AE (1962) Application of ridge analysis to regression problems. Chem Eng Prog 58:54— 


59 


. Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal prob- 


lems. Technometrics 12:55—67 


. James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the 4th Berkeley 


symposium on mathematical statistics and probability, vol. I. University of California Press, 
pp 361-379 


. Kubokawa T (1991) An approach to improving the James—Stein estimator. J Multivar Anal 


36:121-126 


. Ljung L (1999) System identification - theory for the user, 2nd edn. Prentice-Hall, Upper Saddle 


River 


. Manton JH, Krishnamurthy V, Poor HV (1998) James-Stein state filtering algorithms. IEEE 


Trans Signal Process 46(9):243 1-2447 


. Maruyama Y, Strawderman WE (2005) A new class of generalized Bayes minimax ridge 


regression estimators. Ann Stat 1753-1770 


. Phillips DL (1962) A technique for the numerical solution of certain integral equations of the 


first kind. J Assoc Comput Mach 9:84—97 


. Pillonetto G, Dinuzzo F, Chen T, De Nicolao G, Ljung L (2014) Kernel methods in system 


identification, machine learning and function estimation: a survey. Automatica 50 


. Shinozaki N (1974) A note on estimating the mean vector of a multivariate normal distribution 


with general quadratic loss function. Keio Eng Rep 27:105-112 


. Stein C (1956) Inadmissibility of the usual estimator for the mean of a multivariate distribution. 


In: Proceedings of the 3rd Berkeley symposium on mathematical statistics and probability, vol 
I. University of California Press, pp 197—206 


. Stein C (1981) Estimation of the mean of a multivariate normal distribution. Ann Stat 9:1135- 


1151 


. Strawderman WE (1971) Proper Bayes minimax estimators of the multivariate normal mean. 


Ann Math Stat 42:385-388 


. Strawderman WE (1978) Minimax adaptive generalized ridge regression estimators. J Am Stat 


Assoc 73:623-627 


. Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. Winston/Wiley, Washing- 


ton, D.C 

Tikhonov AN (1943) On the stability of inverse problems. Dokl Akad Nauk SSSR 39:195-198 
Tikhonov AN (1963) On the solution of incorrectly formulated problems and the regularization 
method. Dokl Akad Nauk SSSR 151:501-504 


References 15 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 2 A) 
Classical System Identification get 


Abstract System identification as a field has been around since the 1950s with 
roots from statistical theory. A substantial body of concepts, theory, algorithms and 
experience has been developed since then. Indeed, there is a very extensive literature 
on the subject, with many text books, like [5, 8, 12]. Some main points of this 
“classical” field are summarized in this chapter, just pointing to the basic structure of 
the problem area. The problem centres around four main pillars: (1) the observed data 
from the system, (2) a parametrized set of candidate models, “the Model structure”, 
(3) an estimation method that fits the model parameters to the observed data and (4) 
a validation process that helps taking decisions about the choice of model structure. 
The crucial choice is that of the model structure. The archetypical choice for linear 
models is the ARX model, a linear difference equation between the system’s input and 
output signals. This is a universal approximator for linear systems—for sufficiently 
high orders of the equations, arbitrarily good descriptions of the system are obtained. 
For a “good” model, proper choices of structural parameters, like the equation orders, 
are required. An essential part of the classical theory deals with asymptotic quality 
measures, bias and variance, that aim at giving the best mean square error between 
the model and the true system. Some of this theory is reviewed in this chapter for 
estimation methods of the maximum likelihood character. 


2.1 The State-of-the-Art Identification Setup 


System Identification is characterized by five basic concepts: 


e X: the experimental conditions under which the data is generated; 

QJ: the data; 

M the model structure and its parameters 0; 

J: the identification method by which a parameter value 6 in the model structure 
M (0) is determined based on the data J; 

e V: the validation process that generates confidence in the identified model. 
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Fig. 2.1 The identification 
work loop 


No, try new X 


See Fig.2.1. It is typically an iterative process to navigate to a model that passes 
through the validation test (“is not falsified”), involving revisions of the necessary 
choices. For several of the steps in this loop helpful support tools have been developed. 
It is however not quite possible or desirable to fully automate the choices, since 
subjective perspectives related to the intended use of the model are very important. 


2.2 M: Model Structures 


A model structure ⁄ is set of a parametrized models that describe the relations 
between the inputs u and outputs y of the system. The parameters are denoted by 0 
so a particular model will be denoted by %4 (0). The set of models then is 


M = {MO € Dy}. (2.1) 


The models may be expressed and formalized in many different ways. The most 
common model is linear and time-invariant linear (LTD, but possible models include 
both nonlinear and time-varying cases, so a list of actually used concrete model will 
be both very long and diverse. 

It is useful to take the general view that a model gives a rule to predict (one- 
step-ahead) the output at time f, i.e., y(t) (a p-dimensional column vector), based 
on observations of previous input-output data up to time t — 1 (denoted by Z'~! = 
{yt — 1), u(t — 1), y(t — 2), u(t — 2), ...}). Here u(t) is the input at time t and we 
assume here that the data are collected in discrete time and denote for simplicity the 
samples as enumerated by t. 

The predicted output will then be 


$(t10) = g(t, 0, ZT!) (2.2) 
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for a certain function g of past data. This covers a very wide variety of model descrip- 
tions, sometimes in a somewhat abstract way. The descriptions become much more 
explicit when we specialize to linear models. 


A note on “inputs” All measurable disturbances that affect y should be included 
among the inputs u to the system, even if they cannot be manipulated as control 
inputs. In some cases, the system may entirely lack measurable inputs, so the model 
(2.2) then just describes how future outputs can be predicted from past ones. Such 
models are called time series, and correspond to systems that are driven by unob- 
servable disturbances. Most of the techniques described in this book apply also to 
such models. 


A note on disturbances A complete model involves both a description of the input- 
output relations and a description of how various disturbance or noise sources affect 
the measurements. The noise description is essential both to understand the quality of 
the model predictions and the model uncertainty. Proper control design also requires 
a picture of the disturbances in the system. 


2.2.1 Linear Time-Invariant Models 


For linear time-invariant (LTI) systems, a general model structure is given by the 
transfer function G from input u to output y and with an additive disturbance—or 
noise—v(t): 


y(t) = G(q, @)u(t) + v(t). (2.3a) 


This model is in discrete time and q denotes the shift operator gy(t) = y(t + 1). 
The sampling interval is set to one time unit. The expansion of G(q, @) in the inverse 
(backwards) shift operator gives the impulse response of the system: 


G(q, u(t) = X Oque) = YO ge (O)u(t — k). (2.3b) 


k=1 k=1 


The discrete-time Fourier transform (or the z-transform of the impulse response, 
evaluated in z = e’”) gives the frequency response of the system: 


o0 


Geel”, 0) = X g0) i’. (2.3c) 
k=1 


The function G describes how an input sinusoid shifts phase and amplitude when it 
passes through the system. 
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The additive noise term v can be described as white noise e(t), filtered through 
another transfer function H: 


v(t) = H(q, Delt) (2.3d) 
Eelt) =o? (2.3e) 
Ee(tye’(k) =Oifk At (2.3f) 


(£ denotes mathematical expectation). 

This noise characterization is quite versatile and with a suitable choice of H it 
can describe a disturbance with a quite arbitrary spectrum. It is useful to normalize 
(2.3d) by making H monic: 


H(q,0)=1 +h (Oq +- . (2.3g) 


To think in terms of the general model description (2.2) with the predictor as a 
unifying model concept, assuming H to be inversely stable [5, Sect. 3.2] it is useful 
to rewrite (2.3) as 


H (q, yE) = H (q, 0)G(q, Out) + elt) 
y(t) = [1 — H! (q, Oly) + H! (q, 0)Gu(t) + et) = 


y(t) = Gq, Out) + [1 — H~ (q, YA — Gq, DuA] + eA). 
Note that the expansion of H~! starts with “1”, so the first term starts with h,q7! 
so there is a delay in y. That means that the right-hand side is known at time ¢ — 1 
except for the term e(t), which is unpredictable at time t — 1 and must be estimated 
with its mean 0. All this means that the predictor for (2.3) (the conditional mean of 
y(t) given past data) is 


$O) = G(q, Out) + — ANG, ONYO — G@, Ou. (2.4) 


It is easy to interpret the first term as a simulation using the input u, adjusted with a 
prediction of the additive disturbance v(t) at time ¢, based on past values of v. The 
predictor is thus an easy reformulation of the basic transfer functions G and H. The 
question now is how to parametrize these. 


2.2.1.1 The McMillan Degree 


Given just the sequence of impulse responses gg, with k = 1, 2, ..., one may consider 
different ways of representing the system in a more compact form, like rational 
transfer functions or state-space models, to be considered below. A quite useful 
concept is then the McMillan degree: 
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From a given impulse response sequence, gx (that could be p x m matrices that 
describe a system with m inputs and p outputs) form the Hankel matrix 


H, = §2 83 84 eae, (2.5) 


&k Sk+1 8k+2 °°* 82k-1 


Then as k increases, the McMillan degree n of the impulse response is the maximal 
rank of H;: 


n= max rank Hy. (2.6) 


This means that the impulse response can be generated from an nth-order state-space 
model, but not from any lower-order model. 


2.2.1.2 Black-Box Models 


A black-box model uses no physical insight or interpretation, but is just a general 
and flexible parameterization. It is natural to let G and H be rational in the shift 
operator: 


B C 
G@,6) = _ H(q,0) = Dep (2.7a) 
Bq) = biq! + baq”? +... bng ™” (2.7b) 
Fa) =1 + fiq t... fara”, (2.7c) 


with then C and D monic like F, i.e., start with a “1”, and the vector collecting all 
the coefficients 


0 = [bi, bo, ..-4 faf]. (2.7d) 


Common black-box structures of this kind are FIR (finite impulse response model, 
F = C = D = 1), ARMAX (autoregressive moving average with exogenous input, 
F = D), and BJ (Box—Jenkins, all four polynomials different). 


A Very Common Case: The ARX Model 
A very common case is that F = D = A and C = 1 which gives the ARX model 
(autoregressive with exogenous input): 
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y(t) = A7'(q)B(q)u(t) + AW! (q)e(t) or (2.8a) 
A(q)y(t) = B(q)u(t) + e(t) or (2.8b) 
yt) + ary(t —1) +... + an, y(t — na) (2.8c) 

= byu(t — 1) +... +bp,u(t — np). (2.8d) 


This means that the expression for the predictor (2.4) becomes very simple: 


S(t|0) = g7 (D0 (2.9) 
g7 (t) = [-y(t — 1) -yt = 2)... —y(@@— na) u(t — 1) ... u(t — np)] (2.10) 
OT = [a ap... an, bi bz... bn]. (2.11) 


In statistics, such a model is known as a linear regression. 

We note that as n, and n, increase to infinity the predictor (2.9) may approximate 
any linear model predictor (2.4). This points to a very important general approxima- 
tion property of ARX models: 


Theorem 2.1 (based on [6]) Suppose a true linear system is given by 
y(t) = Go(q)u(t) + Holq)elt), (2.12) 


where Go(q) and Ay (q) are stable filters, 
CO 
Gog) = Yo x(q) 
k=1 
CO 
A, @= > kG”) 
k=1 


d(n) = }_ gel + lhe | 


k=n 


and e is a sequence of independent zero-mean random variables with bounded fourth- 
order moments. 

Consider an ARX model (2.8) with orders na, np = n, estimated from N obser- 
vations. Assume that the order n depends on the number of data as n(N), and 
tends to infinity such that n(N)°/N — 0. Assume also that the system is such that 
d(n (N))/N — OasN — œ. Then the ARX model estimates Ann) (q) and Baw) (q) 
of order n(N) obey 


Ê, 1 
Bu) _, Go(q), ——— > Hoq) as N > œ. (2.13) 
An) (Q) Anv) (q) 


Intuitively, the above result follows from the fact that the true predictor for the system 
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$010) = (1 — HYO) + Hy 'Gou(t) = X fryt — K) + eult — k) 
k=l 


is stable. Hence, it can be truncated at any n with arbitrary accuracy, and the truncated 
sum is the predictor of an nth-order ARX model. 

This is quite a useful result saying that ARX models can approximate any linear 
system, if the orders are sufficiently large. ARX models are easy to estimate. The 
estimates are calculated by linear least squares (LS) techniques, which are convex 
and numerically robust. Estimating a high-order ARX model, possibly followed by 
some model order reduction, could thus be an alternative to the numerically more 
demanding general PEM criterion minimization (2.22) introduced later on. This has 
been extensively used, e.g., by [14, 15]. The only drawback with high-order ARX 
models is that they may suffer from high variance. 


2.2.1.3 Grey-Box Models 


If some physical facts are known about the system, these could be incorporated in 
the model structure. Such a model that is based on physical insights and has a built-in 
behaviour that mimics known physics is known as a Grey-Box Model. For example, 
it could that for an airplane whose motion equations are known from Newton’s laws, 
but certain parameters are unknown, like the aerodynamical derivatives. Then it is 
natural to build a continuous-time state-space model from known physical equations: 


it) = AOX (t) + BO ult) 


(2.14) 
y(t) = C(@)x(t) + D(O)u(t) + v(t). 
Here 6 are simply some entries of the matrices A, B,C, D, corresponding to 
unknown physical parameters, while the other matrix entries signify known physical 
behaviour. This model can be sampled with well-known sampling formulas (obeying 
the input inter-sample properties, zero-order hold or first-order hold) to give 


x(t +1) = F(O)x(t) + 4(O)u(t) 


(2.15) 
y(t) = C()x(t) + D@)u(t) + wt). 
The model (2.15) has the transfer function from u to y 
G(q,9) = C(@)[qI — F(0)|) 'G(0) + D0) (2.16) 


so we have achieved a particular parameterization of the general linear model (2.3). 
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2.2.1.4 Continuous-Time Models 


The general model description (2.2) describes how the predictions evolve in discrete 
time. But in many cases we are interested in continuous-time (CT) models, like mod- 
els for physical interpretation and simulation. But CT model estimation is contained 
in the described framework, as the linear state-space model (2.14) illustrates. 


2.2.2 Nonlinear Models 


A nonlinear model is a relation (2.2), where the function g is nonlinear in the input— 
output data Z. There is a rich variation in how to specify the function g more explicitly. 
A quite general way is the nonlinear state-space equation, which is a counterpart to 
(2.15): 


x(t + 1) = f(t), u(t), v(t), 0) 


(2.17) 
y(t) = h(x(t), e(t), 9), 


where v and e are white noises. 


2.3 J: Identification Methods—Criteria 


The goal of identification is to match the model to the data. Here the basic techniques 
for such matching will be discussed. Suppose we have collected a data record in the 
time domain 


Dr = {u(1), yA), ..., u(N), y(N)} (2.18) 
which will be called in this book identification set or training set, with N being its 
size. A natural way to evaluate a model is to see how well it is able to predict the 


measured output since the model is in essence a predictor. It is thus quite natural to 
form the prediction errors for (2.2): 


e(t, 0) = y(t) — $(t10). (2.19) 
The “size” of this error can be measured by some scalar norm: 
tlelt, 0)) (2.20) 


and the performance of the predictor over the whole data record Yr is given by 
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N 
Vy (0) = X lelt, 0). (2.21) 


t=1 
A natural parameter estimate is the value that minimizes this prediction fit: 
Oy = arg MiNgep y Vn (0). (2.22) 


This is the Prediction Error Method (PEM) and it is applicable to general model 
structures. See, e.g., [5] or [7] for more details. 

The PEM approach can be embedded in a statistical setting. The ML methodology 
below offers a systematic framework to do so. 


2.3.1 A Maximum Likelihood (ML) View 


If the system innovations e have a probability density function (pdf) f(x), then 
the criterion function (2.21) with €(x) = — log f(x) will be the logarithm of the 
Likelihood function. See Lemma 5.1 in [5]. More specifically, let the system have p 
outputs, and let the innovations be Gaussian with zero mean and covariance matrix 
A, so that 


y(t) = (t|O) + e(t), e(t) € NO, A) (2.23) 


for the 6 that generated the data. Then it follows that the negative logarithm of the 
likelihood function for estimating 0 from y is 


1 
Ly(@) = 51 Vn) + N log det A + Np log 27], (2.24) 
where Vy (0) is defined by (2.21), with 
L(e(t, 0)) = £T (t, OA e(t, 0). (2.25) 


That means that the maximum likelihood model estimate (MLE) for known A is 
obtained by minimizing Vy (0). If A is not known, it can be included among the 
parameters and estimated, ([5], p. 218), which results in a criterion 


N 
Dy(@) = det 5 e(t, Ae" (t, 0) (2.26) 


t=1 


to be minimized. 
A Bayesian interpretation of (2.22) as well as a regularized version will be given 
in Chap. 4. 
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2.4 Asymptotic Properties of the Estimated Models 


As we have seen in the first chapter, bias and variance play important roles in estima- 
tion problems. We will here give a short account of how these concepts are treated 
in classical system identification. 


2.4.1 Bias and Variance 


The observations, certainly of the output from the system, are affected by noise and 
disturbances. That means that the estimated model parameters (2.22) also will be 
affected by disturbances. These disturbances are typically described as stochastic 
processes, which makes the estimate Êy a random variable. This has a certain prob- 
ability density function, which could be complicated to compute. Often the analysis 
is restricted to its mean and variance only. The difference between the mean and a 
true description of the system measures the bias of the model. If the mean coincides 
with the true system, the estimate is said to be unbiased. As already pointed out in 
(1.1), the total error in a model thus has two contributions: the bias and the variance. 


2.4.2 Properties of the PEM Estimate as N > oo 


Except in simple special cases it is quite difficult to compute the pdf of the estimate 
Êy. However, its asymptotic properties as N — oo are easier to establish. The basic 
results can be summarized as follows (see [5, Chaps. 8 and 9] for a more complete 
treatment): 


e Limit model: 
A : fu, l 
On > OF = aremin| lim — Vy (0) © &£€(e(t, | : (2.27) 
N=>œ N 


Here & denotes mathematical expectation. So the estimate will converge to the 

best possible model, in the sense that it gives the smallest average prediction error. 
e Asymptotic covariance matrix for scalar output models: 

In case the prediction errors e(t) = e(t, 0*) for the limit model are approximately 

white, the covariance matrix of the parameters is asymptotically given by 


a £ d ie 
Covoy ~ [cov sal) , (2.28) 


That means that the covariance matrix of the parameter estimate is given by the 
inverse covariance matrix of the gradient of the predictor w.r.t. the parameters. 
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Here (prime denoting derivatives) 


Ele (e(t))P 


CO = prem. 


(2.29) 


Note that 
k(O =0°=6e(t) if Le) =e?/2. 


If the model structure contains the true system, it can be shown that this covariance 
matrix is the smallest that can be achieved by any unbiased estimate, in case the 
norm £ is chosen as the logarithm of the pdf of e. That is, it fulfils the the Cramér— 
Rao inequality, [2]. These results are valid for quite general model structures. 
Results for LTI models: 

Now, specialize to linear models (2.3) and assume that the true system is described 
by 


y(t) = Go(q)u(t) + Holq)elt), (2.30) 


which could be general transfer functions, possibly much more complicated than 
the model. Then 


6* = arg min f IG(e®, 0) — G (ein 2_ Pu) io (2.31) 
eae eae ee" THE, AP | 


That is, the frequency function of the limiting model will approximate the true 
frequency function as well as possible in a frequency norm given by the input 
spectrum ®, and the noise model. 

— Fora linear black-box model, the covariance of the estimated frequency function 
is 


CovG(e, dy) ~ E 
OV j Pde 
© NN Do) 


as n, N > o, (2.32) 


where n is the model order and ®, is the noise spectrum o? | Hy (e'”)|?. The 
variance of the estimated frequency function at a given frequency is thus, for 
a high-order model, proportional to the noise-to-signal ratio at that frequency. 
That is a natural and intuitive result. 
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2.4.3 Trade-Off Between Bias and Variance 


The quality of the model depends on the quality of the measured data and the flexi- 
bility of the chosen model structure (2.1). A more flexible model structure typically 
has smaller bias, since it is easier to come closer to the true system. At the same 
time, it will have a higher variance: with higher flexibility it is easier to be fooled by 
disturbances and this may lead to data overfitting. So the trade-off between bias and 
variance to reach a small total error is a choice of balanced flexibility of the model 
structure. 

As the model gets more flexible, the fit to the estimation data in (2.22), given 
by Vy (Ôn), will always improve. To account for the variance contribution, it is thus 
necessary to modify this fit to assess the total quality of the model. A much used 
technique for this is Akaike’s criterion (AIC), [1], 


Oy = argmin 2Ly (0) + 2dim0, (2.33) 
M OED4 


where Ly is the negative log likelihood function. The minimization also takes place 
over a family of model structures with different number of parameters (dim 0). 

For Gaussian innovations e with unknown and estimated variance, the criterion 
AIC takes the form 


= argmin | log de y elt, Oe (t, + .34 
N g g N N 


M OED t=1 


with m = dimé and after normalization and omission of model-independent quanti- 
ties. 

There is also a small-sample version, described in [4] and known in the literature 
as corrected Akaike’s criterion (AICc), defined by 


Êy = arg min, | log det : Sa Oe (t,0)| +2 m AICc 

ee aE ee i (N—m—1) |’ i 
(2.35) 

Another variant places a larger penalty on the model flexibility: 
N 

Ên = arg min; | log det LY sq aye" (t, 0) | + log(N)— |, BIC, MDL 

N 8 N i ’ 5 N $ ? * 
(2.36) 


This is known as Bayesian information criterion (BIC) or Rissanen’s Minimum 
Description Length (MDL) criterion, see, e.g., [10, 11] and [5, pp. 505-507]. 
Section 2.6 contains further aspects on the choice of model structure. 


2.5 X: Experiment Design 29 


2.5 X: Experiment Design 


Experiment design involves all questions that concern the collection of estimation 
data, such as selecting which signals to measure, which sampling rate to use, and 
also the design of the input including possible feedback configurations. 

The theory of experiment design primarily relies upon analysis of how the asymp- 
totic parameter covariance matrix (2.28) depends on the design variables: so the 
essence of experiment design can be symbolized as 


min trace{C[Ew(t)w! OIT, 


where y is the gradient of the prediction w.r.t. the parameters and the matrix C is 
used to weight variables reflecting the intended use of the model. 

For linear systems, the input design is often expressed as selecting the spectrum 
(frequency contents) of u. 

This leads to the following recipe: let the input’s power be concentrated to fre- 
quency regions where a good model fit is essential, and where disturbances are 
dominating. 

The measurement setup, like if band limited inputs are used to estimate continuous- 
time models and how the experiment equipment is instrumented with band-pass fil- 
ters, e.g., see [8, Sects. 13.2—3], also belongs to the important experiment design 
questions. 


2.6 VY: Model Validation 


Model validation is about obtaining a model that, at least for the time being, can be 
accepted. It amounts to examining and scrutinizing the model to check if it can be 
used for its purpose. These methods are of course problem dependent and contain 
several subjective elements, Therefore, no conclusive procedure for validation can 
be given. A few useful techniques will be listed here. Basically it is a matter of trying 
to falsify a model under the conditions it will be used for and also to gain confidence 
in its ability to reproduce new data from the system. 


2.6.1 Falsifying Models: Residual Analysis 


An estimated model is never a correct description of a true system. In that sense, a 
model cannot be “validated”, i.e., proved to be correct. Instead it is instructive to try 
and falsify it, i.e., confront it with facts that may contradict its correctness. A good 
principle is to look for the simplest unfalsified model, see, e.g., [9]. 
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Residual analysis is the leading technique for falsifying models: the residuals 
or one-step-ahead prediction errors é(t) = e(t, Êy) = y(t)— $(t|On) should ideally 
not contain any traces of past inputs or past residuals. If they did, it means that the 
predictions are not ideal. So, it is natural to test the correlation functions 


N 

Ps u (k) = x 2 a(t + bult) (2.37) 
1 N 

P3(k) = a a(t +O (2.38) 


t=1 


and check that they are not larger than certain thresholds. Here N is the length of the 
data record and k typically ranges over a fraction of the interval [—N, N]. See, e.g., 
[5, Sect. 16.6] for more details. 


2.6.2 Comparing Different Models 


When several models have been estimated it is a question to choose the “best one”. 
Then, models that employ more parameters naturally show a better fit to the data, 
and it is necessary to outweigh that. The model selection criteria AIC (2.34) and 
BIC (2.36) are examples of how such decisions can be taken. They can be extended 
to regular hypothesis tests where more complex models are accepted or rejected at 
various test levels, see, e.g., [5, Sect. 16.4]. 

Making comparisons in the frequency domain is a very useful complement for 
domain experts used to think in terms of natural frequencies, natural damping, etc. 


2.6.3 Cross-Validation 


Cross-validation (CV) is an important statistical concept that loosely means that the 
model performance is tested on a data set (validation data) other than the estimation 
data. There is an extensive literature on cross-validation, e.g., [13] and many ways 
to split up available data into estimation and validation parts have been suggested. 
The goal is to obtain an estimate of the prediction capability of future data of the 
model in correspondence with different choices of 0. Parameter selection is thus 
performed by optimizing the estimated prediction score. Hold out validation is the 
simplest form of CV: the available data are split in two parts, where one of them 
(estimation set) is used to estimate the model, and the other one (validation set) is 
used to assess the prediction capability. By ensuring independence of the model fit 
from the validation data, the estimate of the prediction performance is approximately 
unbiased. For models that do not require estimation of initial states, like FIR and ARX 
models, CV can be applied efficiently in more sophisticated ways by splitting the 
data into more portions, as described in [3]. 
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Chapter 3 A) 
Regularization of Linear Regression ciecie; 
Models 


Abstract Linear regression models are widely used in statistics, machine learning 
and system identification. They allow to face many important problems, are easy to 
fit and enjoy simple analytical properties. The simplest method to fit linear regression 
models is least squares whose systematic treatment is available in many textbooks, 
e.g., [35, Chap. 4], [12]. Linear regression models can be fitted also in different 
way and a class of methods that we will consider in this chapter is the so-called 
regularized least squares. It is an extension of least squares which minimizes the 
sum of the square loss function and a regularization term. This latter can take various 
forms, leading to several variants which have been applied extensively in theory as 
well as in practical applications. In this chapter, we will focus on these methods and 
introduce their fundamentals. In the first part of the appendix to this chapter, we also 
report some basic results of linear algebra useful for the reading. 


3.1 Linear Regression 


Regression theory is concerned with modelling relationships among variables. It is 
used for predicting one dependent variable based on the information provided by one 
or more independent variables. In linear regression, the relationship among variables 
is given by linear functions. To illustrate this, we start from the function estimation 
problem because it is intuitive and easy to understand. 

The aim of function estimation is to reconstruct a function g : R” —> Rwithn € N 
from a collection of N measured values of g(x) and x which we denote, respectively, 
by y; and x; fori = 1,..., N. For generic values of x, the estimate g should give a 
good prediction g(x) of g(x). The variables x and g(x) are often called the input and 
the output variable or simply the input and the output, respectively. The collection of 
measured values of x and g(x), given by the couples {x;, y;}, is called the data set or 
also the training set. In practical applications, the measurement y; is often not precise 
and subject to some disturbance, i.e., for a given input x; there is often discrepancy 
between g(x;) and its measured value y;. To describe this phenomenon, it is natural 
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to introduce a disturbance variable e € R and assume that, for any given x € R”, the 
measured value of g(x) is 


y = g(x) +e. (3.1) 


Hence, y is the measured output and g (x) is the noise-free or true output. Accordingly, 
the training data {x;, y;}/_, are collected as follows: 


y= gk) tens i=1,...,N. (3.2) 


We are interested in linear regression models for estimation of g. For illustration, an 
example is now introduced. 


Example 3.1 (Polynomial regression) We consider g : [0, 1]  R and assume that 
such function is smooth. Then, g can be well approximated by polynomials with 
a certain order. In this case, a linear regression model for the function estimation 
problem takes the following form: 


yi = A+ > xf" + ei, i=1,...,N, (3.3) 
k=2 
where & € R fork =1,...,n. Defining 


oxi) =[1 x; ... x71], 0 = [0 02... 0n], (3.4) 


L 


where, for a real-valued matrix A, the notation A’ denotes its matrix transpose, we 
rewrite (3.3) as 


yi = O(x;))'O +e, i=1,...,N (3.5) 


obtaining a more compact expression. 


Although (3.5) is derived from Example 3.1, it is the general linear regression 
model studied in the theory of regression. For convenience, we remove the depen- 
dence of #(x;) on x; and simply write ¢(x;) as ¢;, when the context is clear. In 
addition, all the vectors are column vectors. Then, model (3.5) becomes 


y=¢/0+e, i=1,...,.N, yi €R, & €R", OER", eG €R. (3.6) 


In what follows, we will focus on (3.6) and introduce the linear regression problem, 
the methods of least squares and regularized least squares. We will call y; € R the 
measured output, ¢; € R” the regressor, 0 € R” the model parameter, n the model 
order, and e; the measurement noise. 

Before proceeding, it should be noted that the choice of the model order n is a 
critical problem in practical applications. The rule of thumb is to set n to a large 
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enough value such that g can be represented by the proposed model structure. In sys- 
tem identification, this corresponds to introducing a model structure flexible enough 
to contain the true system. Consider, e.g., Example 3.1 again and assume that the 
function g is actually a polynomial of order 5. Clearly, if the dimension of 6 does 
not satisfy n > 6, then x° cannot be represented and some model bias will affect the 
estimation process. However, the order n should not be chosen larger than necessary, 
because this can increase the variance of the model estimate. This problem is actually 
the same as model selection complexity in the classical system identification and is 
connected with the bias-variance trade-off illustrated in the first two chapters and 
also discussed in more detail shortly. 

Also in light of the above discussion, we often assume that the model order n is 
either large enough for g to be adequately represented by the proposed model or even 
that a true model parameter that has generated the data exists, denoted by 0) € R”. 
Hence, we can formulate linear regression as the problem of obtaining an estimate 
6 such that, given a new regressor ġ € R”, the prediction Tô is close to 76o. 


3.2 The Least Squares Method 


There are many methods to estimate 0 in the linear regression model (3.6). In this 
section, we consider the least squares (LS) method. 


3.2.1 Fundamentals of the Least Squares Method 


Given the data y;, ġ; fori = 1,..., N, one way to estimate 0 is to minimize the least 
squares (LS) criterion: 


N 
S =argmin 1), O = 0- $7 8)", (3.7) 
i=1 


where /(@) is the LS criterion and ĝLS is the LS estimate of 8. Then, the predicted 
output for the value of #7 6o with @ € R” is obtained as 


$ = G7, (3.8) 


3.2.1.1 Normal Equations and LS Estimate 


The LS estimate 64S given by (3.7) has a closed-form expression. To see this, note 
that the first- and second-order derivatives of /(@) with respect to 0 are 
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16) wor ~ 07100) sa or 
ao = 278-2) dn 593g T22 vidi =O (3.9) 


where A > 0 means that A is a positive semidefinite matrix. Then all ĝLS that satisfy 


N N 
bs z aS = X` diyi (3.10) 
i=1 i=1 


are global minima of /(@). The set of Eqs. (3.10) is known as the normal equations. 
For the time being, we assume that )~”_, ¢@7 is full rank.! Then 


. N =L y 
gis = Za X iyi. (3.11) 
i=1 i=1 


3.2.1.2 Matrix Formulation 


It is often convenient to rewrite the LS method in matrix form. To this goal, let 


yı OL ey 
y2 o3 ez 

Veal eel 7], z=]. (3.12) 
YN bn eN 


We can then rewrite (3.6) with the 0o that generated the data, the LS criterion (3.7), 
the normal Eqs. (3.10) and the LS estimate (3.11) in matrix form, respectively: 


Y = h + E (3.13) 
6'S — arg min 1(0), 10) = ||Y — ®0||2 (3.14) 
6 
o' oS = py (3.15) 
6S = (@' @)'@'y, (3.16) 
where || - ||2 is the Euclidean norm, i.e., the 2-norm, and @ is called the regression 


matrix. 


' Recall that the column rank (resp., the row rank) of a matrix is the dimension of the space spanned 
by the columns (resp., the rows) of the matrix. It is a fundamental result in linear algebra that the 
column rank and the row rank of a matrix are always equal and this number is called the rank of 
the matrix. A matrix is said to be full rank if its rank is equal to the lesser of the number of rows 
and columns and a matrix is said to be rank deficient otherwise. 
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3.2.2 Mean Squared Error and Model Order Selection 


3.2.2.1 Bias, Variance, and Mean Squared Error of the LS Estimate 


We study the linear regression problem in a probabilistic framework, assuming that 
data are generated according to (3.13) and that 


the measurement noises ej, for i = 1,..., N, are i.i.d. with mean 0 and variance o2. 


(3.17) 


Due to this assumption, the LS estimator ĝLS, as well as any estimator of 0 dependent 
on the data, becomes random variables. Then, it is interesting to study the statistical 
properties of 64S, such as the bias, variance and mean squared error (MSE). 

All the expectations reported below are computed with respect to the noises e; 
with the regressors ġ; assumed to be deterministic. Simple calculations lead to 


& (645) = bo (3.18a) 
OES, = E (01S) — % = 0 (3.18b) 


Cov(6"5, 615) = E0 — E (FS) 65 — 6G))7] = o? (PTP)! (3.180) 
MSE(6"S, 60) = EL! — 60)(6"S — 60)"] 


bias \“ bias 


=0° (PTP) t, (3.18d) 


where Cov(6"S, 64S) is the covariance matrix of 615 and MSE(64S, 0o) is the MSE 
matrix of LS function of the true model parameter 6. 


3.2.2.2 Model Order Selection 


The issue of model order selection is essentially the same as that of model complex- 
ity selection in the classical system identification scenario. Therefore, the techniques 
introduced in Sect. 2.4.3 can be used to choose the model order n, e.g., Akaike’s infor- 
mation criterion (AIC) [1], the Bayesian Information criterion (BIC) or Minimum 
Description Length (MDL) approach [25, 39]. 

The quality of the LS estimate ĝLS depends on the adopted model order n. 
In practical applications, model complexity is in general unknown and needs to 
be determined from data. As the model order n gets larger, the fit to the data 
Y — pols II3 in (3.14) will become smaller, but the variances along the diagonal of 
the MSE matrix (3.18d) of 64S will become larger at the same time. When assessing 
the quality of ĝLS, one way to account for the increasing variance is to introduce 
criteria that suitably modify the plain data fit. AIC and BIC are techniques following 
this idea and can be used for model order selection. More specifically, besides (3.17), 
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further assuming that the errors are independent and Gaussian, i.e., 


ei~ V(0,07), i=1,...,N (3.19) 
with known noise variance o”, we obtain 
ALS . 1 2 2” 
AIC: 0°” = argmin —||Y — 80 |3 + 20° —, (3.20) 
ger N N 
ALS zl 2 Qt 
BICor MDL: 6” = argmin —||Y — ®6||5 + log(N)o*—, (3.21) 
gcrr N N 


where the minimization also takes place over a family of model structures with 
different dimension n of 0. 

Another way is to estimate the prediction capability of the model on some unseen 
data which are not used for model estimation. As briefly seen in Sect. 2.6.3, cross- 
validation (CV) exploits this idea and is among the most widely used techniques for 
model selection. Recall that hold out CV is the simplest form of CV with data divided 
into two parts. One part is used to estimate the model with different model orders 
and the other part is used to assess the prediction capability of each model through 
the prediction score ||Y, — 64S \|3. Here, Y,, ®, are the validation data which are 
different from those used to derive 64S. The model order giving the best prediction 
score will be chosen. 

The noise variance o? of the measurement noises e; plays an important role in 
statistical modelling, e.g., in the assessment of the variance of 64S and in the model 
order selection using, e.g., AIC (3.20) or BIC (3.21). In practical applications, the 
noise variance o? is in general unknown and needs to be estimated from the data 
Y and @. It can be estimated in different ways based on the maximum likelihood 
estimation (MLE) method or the statistical property of Gis, 

Under (3.17) and the Gaussian assumption (3.19), the ML estimate of o ?, as given 
in [25, p. 506], is 


1 : 
G2ML — x IY — 66" |/2. (3.22) 


Using only assumption (3.17), an unbiased estimator of o°, as given in [25, p. 554], 
turns out 


1 i 
G7 = sar — 664 |/2, (3.23) 


AIC and BIC were reported, respectively, in (3.20) and (3.21) assuming known noise 
variance. When ø? is unknown, the use of the ML estimate (3.22) leads to the widely 
used AIC and BIC for Gaussian innovations, e.g., [25, pp. 506-507]: 
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Fig. 3.1 Polynomial regression: the function g(x) (blue curve) and the data {x;, yi oy (red circles) 


Mga 1 iy- oe?) +27 
AIC: 06” = argmin log IY — 80| | +2—, (3.24) 
OER" N N 


P 1 
BIC orMDL: 06/5 = arg min log (<I — 01) + log(N) Č. (3.25) 
oe N N 


Example 3.2 (Polynomial regression using LS and discrete model order selection) 
We apply the LS method and the model order selection techniques to polynomial 
regression as sketched in Example 3.1. Let the function g be 


g(x) = sin? (x)(1 — x”), x € [0,1]. (3.26) 
Then, we generate the data as follows: 
yi = sin? (x) (1 — xô) +e;, i=1,..., 40, (3.27) 


where xı = 0, x49 = 1, the x2, ..., x39 are evenly spaced points between xı and 
X40, and the noises e; are i.i.d. Gaussian distributed with zero mean and standard 
deviation 0.034. The function g and the generated data are shown in Fig. 3.1. 

The function g is smooth and can be well approximated by polynomials. However, 
it is unclear which order should be chosen. Hence, we test the values n = 1,...,15 
and, for each order n, we form the regressor (3.4), the linears regression model (3.13) 
and derive the LS estimate 64S. As shown in Fig. 3.2, as the order n increases the 
data fit || Y — 66'S ||? keeps decreasing. 
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Fig. 3.2 Polynomial regression: profile of the LS data fit as a function of the discrete model 
order n 


For model order selection, we use AIC (3.24), BIC (3.25) and hold out CV with 
Xi, Yi, 1 =1,3,...,39 for estimation and x;, y;, i = 2,4,...,40 for validation. 
Figure3.3 plots the values of AIC (3.24), BIC (3.25) and the prediction score of 
hold out CV. The order n selected by AIC and BIC are the same and equal to 3 while 
that selected by hold out CV is 7. 

To evaluate the performance of models of different complexity, we compute the 
fit measure 


n 1/2 
Tamili Ee suaint] 1 


2° = — (x). (3.28) 
A len — 2 o 40 2 s0 


Note that F = 100 means a perfect agreement between g(x) and the corresponding 
estimate. The model fits forn = 1, ..., 15 are shown in Fig. 3.4: the ordern = 3 gives 
the best prediction. Figure 3.5 plots the estimates of g(x) for n = 3, 7, 15 over the 
xj,i = 1, ... , 40. Overfitting occurs when n = 15, indicating that the corresponding 
model is too flexible and fooled by the noise. 
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Fig. 3.3 Polynomial regression: model order selection with n = 1,..., 15 using LS. The blue 


curve, the red curve and the yellow curve show the values of AIC (3.24), BIC (3.25) and the 
prediction score of hold out CV, respectively 
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Fig. 3.4 Polynomial regression: profile of the model fit (3.28) as a function of the order n using 
LS. The most accurate estimate is obtained with model order equal to 3 which corresponds to a 
second-order polynomial 
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Fig. 3.5 Polynomial regression: true function (blue line) and LS estimates obtained using three 
different model orders given by n = 3, 7 and 15 


3.3 Ill-Conditioning 
3.3.1 Ill-Conditioned Least Squares Problems 


When & e RY*” with N > n is rank deficient, i.e., rank(®) < n, or “close” to 
rank deficient, the corresponding LS problem is said to be ill-conditioned. Examples 
were already encountered in Sect. 1.1.2 to discuss some limitations of the James— 
Stein estimators and in Sect. 1.2 in the context of FIR models. There are different 
ways to handle ill-conditioned LS problems. Below, we show how to calculate ĝLS 
more accurately by using the singular value decomposition (SVD). 


3.3.1.1 Singular Value Decomposition 


SVD is a fundamental matrix decomposition. Any matrix ® € RY*”, with N > n 
to simplify the exposition, can be decomposed as follows: 


= UAV", (3.29) 
where A is a rectangular diagonal matrix with nonnegative diagonal entries oj, 


i=1,...,nandU € R®*" and V € R”*” are orthogonal matrices, i.e., such that 
UTU = UU" = In and VTV = VV! = I,. The factorization (3.29) is called the 
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singular value decomposition of ® and the o; are called the singular values of ®. 
Without loss of generality, they can be assumed to be ordered according to their 
magnitude: 


Since 67’ = V AT AVT = VD?V", where D is a square diagonal matrix whose 
diagonal entries are the o;, it follows that 


Oi = ViA;(®"®@), i= lisst (3.30) 


where à; (A) denotes the ith eigenvalue of the matrix A. 


3.3.1.2 Condition Number 


The condition number of a matrix is a measure of how “close” is the matrix to rank 
deficient. When ® is an invertible square matrix, it is denoted by cond(®) below 
and defined as 


cond(®) = THIS], (3.31) 


where || - || is a matrix norm, with the convention that cond(®) = œ for singular ®. 
For a generic ® € R*", with SVD in the form (3.29), its condition number with 
respect to the 2-norm || - ||2 is defined as 

Omax 


cond(®) = 1*, (3.32) 


Omin 


where Omax = 01 and Omin = On are the largest and smallest singular values of ®, 
respectively. If we use the 2-norm || - ||2 in (3.31), then (3.31) coincides with (3.32). 
Hereafter, the condition number of a matrix will be defined by (3.32). 


3.3.1.3 Ill-Conditioned Matrix and LS Problem 

The condition number of a matrix is important since it can be used to measure the 
sensitivity of the LS estimate to perturbations in the data. To be specific, let P € RY*” 
be full rank and let SY denote a small componentwise perturbation in Y. The solution 


of the perturbed LS criterion becomes 


64S = arg min || (Y + 8Y) — ®9||2. (3:33) 
0 


Then, it can be shown, e.g., [17, Chap. 5], [10, Chap. 3], that 
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_ 18¥ lp 
IYI 


HLS ALS 
l0 — 82” lle 


as < cond(®)e+ O (e?) ; 
|x” |l2 


(3.34) 


So, the relative error bound depends on cond(®@): the larger cond(®), the larger the 
relative error. One can thus say that the matrix ® (and the LS problem) with a small 
condition number is well conditioned, while the matrix ® (and the LS problem) with 
a large condition number is ill-conditioned. The condition number enters also more 
complex bounds on the relative error due to perturbations on the matrix ® [10, 17]. 


Example 3.3 (Effect of ill-conditioning on LS) Consider the linear regression model 
(3.13). Let 


1 1 1 1 
2 = Flite iio | reji: (aal 


The two singular values of ® are Omax = 1 and Omin = 5 X 1079, implying that 
cond(®) = 2 x 108. Thus, ® and the LS problem (3.14) are ill-conditioned. 
Using the normal Eq. (3.15), we obtain the LS estimate 64° in closed form: 


6 = (PTE) 16 y = 7HY = Hi (3.36) 


Now, suppose that there is a small perturbation dY in Y 


0.01 
sy =| le (3.37) 


Solving the normal Eq. (3.15) with Y replaced by Y + ôY now gives 


ALS _ 1.01 — 10° 
oa" = ha + 108 | 38) 


So, when the LS problem (3.14) is ill-conditioned, a small perturbation in Y could 
cause a significant change in the LS estimate derived by solving the normal Eq. (3.15) 
directly. 


Example 3.4 (Polynomial regression: ill-conditioned LS Problem) We revisit the 
polynomial regression Examples (3.26) and (3.27) stressing the dependence of the 
condition number on the polynomial complexity. In particular, Fig.3.6 shows that 
the ill-conditioning of the regression matrix ® constructed according to (3.4) and 
(3.12) augments as the dimension n increases. This further points out the importance 
of a careful selection of the discrete model order to control the estimator’s variance 
when using LS. 
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Fig. 3.6 Polynomial regression: profile of the base 10 logarithm of the condition number of ® as 
a function of the order n 


3.3.1.4 LS Estimate Exploiting the SVD of ® 


In order to obtain more accurate LS estimates for ill-conditioned problems, one can 
use the SVD of ®. Given 6 € R%*" with N > n, we consider two cases: 


e @ is rank deficient, i.e., rank(®) < n. 
e @ is full rank but has a very large condition number, i.e., rank(®) = n but cond(®) 
is very large. 


For the rank-deficient case, we assume without loss of generality that rank(®) = 
m < n. In this case, the LS problem does not have a unique solution. To get a special 
solution, we have to impose extra conditions on the solutions of the LS problem. 
Let the singular value decomposition of ® be 


ð = UAV" =[U, vaJ| g ol v], (3.39) 


where A, € R”*” is diagonal and positive definite while U, € RY*” and V; € 
R” xm S 
We now perform a change of coordinates in both the output and parameter space 


=- ry [uY] T^ ~ oro [|v0]_ Tå 
ř=U v= hy a EAL = v10 = vrol=|a, |: 
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Note that both Y 1 and 6, are m-dimensional vectors. In the new coordinates, the 
residual vector is 


U7 (Y—@0)=Y-—Ad= k Paap 


Yə 


The LS criterion can be rewritten as 


IY — 801? = (Y — 80) UUT (Y — 80) = |Ý — AOI? = WX — 418111? + Yall? 


pls — dis | = ea (3.40) 


ais b2 


and is minimized by 


where 62 € R”—” is an arbitrary vector. To get the minimum norm solution, one can 
set 02 = 0 that, turning back to the original coordinates, yields 


6s — yg = V, A7 ' UTY. (3.41) 


Interestingly, for the rank-deficient case, the special solution (3.41) relates to the 
Moore-Penrose pseudoinverse of ®, defined as 


A;' 0 


Po +77T _ 
t = VXU =[v: ve] a 


ta U| =Vd71U7. 


So, given a matrix X, its pseudoinverse X" is obtained by replacing all the nonzero 
diagonal entries by their reciprocal and transposing the resulting matrix. When 
rank(®) = n, the pseudoinverse returns the usual (unique) LS solution 


p+ = (676) PT. 


It follows that the minimum norm solution among the general solutions of the LS 
problem (3.14) can be always written as 


ols = pty. 


For the rank-deficient case, due to roundoff errors, Ø may have some very small 
computed singular values other than the m singular values contained in A, in (3.39). 
The situation is similar to the case where @ is full rank but with a very large condition 
number. Note also that the rank of ® needs to be known beforehand to compute the 
SVD of ®. However, numerical determination of the rank of a matrix is nontrivial 
(and out of scope of this book). Here, we just mention a simple way to deal with 
these issues by using the so-called truncated SVD. 
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Consider the SVD (3.39) and, without loss of generality, assume 


A = diag(oj, 02, ..., On) with o1 > o2 >--->o0, > 0. 
Now set 6; = 0; if o; > tol and ô; = 0 otherwise. Then 
6 =UAV’', (3.42) 
where A e RY*” is diagonal with entries G1, G2, . . . , Ôn, is called the truncated SVD 


of ®. So, the truncated SVD (3.42) can be used to handle the case where ® has full 
rank but large condition number: for a given tol, it suffices to replace ® with ® and 
then to compute the LS estimate of 0 by means of @TY. 


Example 3.5 (Truncated SVD) We revisit Example 3.3 by making use of the trun- 
cated SVD of ®. We take the user-supplied measure of uncertainty fol to be le-7. 
Then the LS estimate 64° computed by (3.41) with Y replaced by Y + 5Y becomes 


(3.43) 


1.0049 


One can thus see that the estimate is now very close to [1 1]? which was the one 
obtained in absence of the perturbation 6Y. 


3.3.2  Ill-Conditioning in System Identification 


In Sect. 1.2 we have illustrated an ill-conditioned system identification problem. 
Below, we will see that the difficulty was due to the fact that low-pass filtered inputs 
may induce regression matrices with large cond(®). 

Consider the FIR model of order n: 


y(t) = X gat —-Kh+e), t=1,..., N, (3.44) 
k=1 


which can be written in the form (3.13) as follows: 
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ya) u(0) u(-1)_ ++» ud —n) 
y(2) é u(1) u(2) +++ uQ—n) 


y(N) iN = 1) i = 2) +: WN = n) (3.45) 


8n eN 


Then we have 


a y ut)? NTuput-1) ... ENF uur —n +1) 

ee ae u(t)u(t — 1) No2 u(t)? siy ya u(t)u(t —n +2) 
| 0 T ERT D PA EEEN Sas u(t)? | 
(3.46) 


Since cond(@? @) = (cond(@))?, we study cond(®? ®) in what follows. In addi- 
tion, while so far we have assumed deterministic regressors, now we work in a more 
structured probabilistic framework where the system input is a stochastic process. 
This implies that ® is a random matrix. In particular, u(t) is filtered white noise, 
with the filter assumed to be stable and given by 


H(q) = > h(k)q-. (3.47a) 
k= 
Hence, 
u(t) = D> hk = k) = Have), (3.47b) 
k=0 


where v(t) is zero-mean white noise of variance o with bounded fourth moments. It 
comes that u(t) is a zero-mean stationary stochastic process with covariance function 
k,(t, s) = &[u(t)u(s)] = R,(t — s) with R, (7) defined as follows: 


S{u(t)u(t — t)] = 2 SS hAADElve — ve = r -= DI 
1=0 


= )oh()ak — 1) Ê R, (tr). 
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From the ergodic theory, e.g., [25, Theorem 3.4], it also follows that 
1x 
> > u(t)u(t — t) > R,(t), N —> œ, as. (3.48) 
t=1 


From (3.46) and (3.48), one obtains the following almost sure convergence: 


R, (0) R,(1) = Soy Ry (n aa 1) 
f R, (1) R, (0) -+ Ru(n — 2) 
— ø’: p — . : . , Noo, as. (3.49) 
Ryn — 1) Ryn —2) +++ Ry (0) 


So, limy% iplo is the covariance matrix of [ u(1) ... u(n) i whose condition 
number thus provides insights on the ill-conditioning affecting the system identifi- 
cation problem. 

Since the covariance matrix is real and symmetric, its condition number is the 
ratio between the largest and the smallest of its eigenvalues. An important result of 
O. Toeplitz, e.g., [44], [20, Chap. 5], says that as n —> oo, the eigenvalues of the 
covariance matrix of the infinite-dimensional vector [u(1) u(2).. al coincide with 
the set of values assumed by the power spectrum of u(t), which is given by 


+00 
W,(w) = > R (T). (3.50) 


T=—-CO 


Hence, considering also that Y, (—œw) = W,,(@), one has 


1 W! T VA 
cond ( tim lim z?) — MaXvel0.r) Pulo) (3.51) 
n> —> 00 


minoelo,z] Y, (w) l 


In addition, since u(t) is a filtered white noise (3.47) and H (q) is stable, one also 
has [see, e.g., [25, p. 37] for details]: 


W,(@) = 0° |H(e'”)|?, (3.52) 


where He!) is the frequency function of the filter H (q), i.e., 


He”) = x h(k)e i, (3.53) 
k=0 


Finally, combining the results (3.49)-(3.53) yields 
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1 
cond ( lim lim ne (3.54) 


n—oo Noo 


) — maxye(o,x] |H (e°)? 
minoejo, z] |H (e'”)|? 


When the maximum of | H (e”)| is significantly larger than the minimum of | H (e’”)|, 
the matrix lim, limy_s5 ipo could be very ill-conditioned. For instance, if 
we consider the stable filter 


H (q4) = 0<a<l, (3.55) 


1 
(1 — aq-")?’ 
then one has 


max,¢{0,x) HA (1 +a)* 
mingeoa |H(e)|? (A -— a) 


(3.56) 


As a varies from 0.01 to 0.99, input power is more concentrated at low frequencies 
and the ill-conditioning affecting the system identification problem augments. In 
fact, the above quantity increases from about 1 to 1.6 x 10°. 


3.4 Regularized Least Squares with Quadratic Penalties 


One way to handle ill-conditioning is to use regularized least squares (ReLS). Such 
method will play a special role in this book to control overfitting by encoding prior 
knowledge. First insights on these aspects are provided below. 

ReLS adds a regularization term J (0) into the LS criterion (3.14), yielding the 
following problem: 


68 = arg min ||Y — 8012 + y J (0), (3.57) 
8 


where y > 0 is often called the regularization parameter. It has to balance the adher- 
ence to the data ||Y — ®6||3 and the penalty J (0). There are many choices for the 
regularization term which can be connected with the prior knowledge on the true 
model parameter 6p that needs to be estimated. 

In this section, we consider regularization terms J(@) which are quadratic func- 
tions of 6. The resulting estimator will be denoted by ReLS-Q in this chapter. In 
particular, we let J (0) = 07 P719 so that the ReLS criterion (3.57) becomes 


6® = arg min ||Y — ||? + ye? P718 (3.58a) 
g 

=(6'@+yP"')'o'y (3.58b) 

= Pb’ (PT + yIy)1Y (3.58c) 


= (PØ +y) 'P@'Y, (3.58d) 
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where P € R”*” is a positive semidefinite matrix, here assumed invertible, often 
called the regularization matrix, and /,, is the n-dimensional identity matrix.” 


Remark 3.1 The regularization matrix P could be singular. In this case, (3.58a) is 
not well defined but, with a suitable arrangement, we can use the Moore—Penrose 
pseudoinverse P+ instead of P71. In particular, let the SVD of P be 


P = [U Up] k j [U vf", 


where Ap is a diagonal matrix with the positive singular values of P as diagonal 
elements and U = [U 1U 2] is an orthogonal matrix with U, having the same number 


of columns as that of Ap. Recall also that Pt = U, Ap U. In order to find how 
(3.58a) should be modified for singular P, let us consider 


P, = [Uy Us] is || [U U], 2 >0. 


By replacing P with P, in (3.58a), we obtain 


ĝe = argmin |Y — 933 + y0TULAp UTO + “0 U2U39. (3.59) 

If we lete — 0, it follows that the parameter vector must satisfy UJ @ = 0. Therefore, 
we may conveniently associate to a singular P the modified regularization problem 
6® = arg min IY — Bo||5 + yo" Pto (3.60a) 

subj.to U,6=0. (3.60b) 

If P7! is replaced by P+, it is easy to verify that (3.58c) or (3.58d) is still the optimal 


solution of (3.60). Instead, this does not hold for (3.58b). For convenience, we will 
use (3.58a) in the sequel and refer to (3.60) for its rigorous meaning. 


3.4.1 Making an Ill-Conditioned LS Problem Well 
Conditioned 


The ReLS-Q can make the ill-conditioned LS problem well conditioned. Consider 
ridge regression which, as discussed in Sect. 1.2, corresponds to setting P = In, 
hence obtaining 


2 The step from (3.58c) to (3.58d) follows from the matrix equality A(Z; + BA)! = (Ik + 
AB)~1+A which holds for every A € R¥%/ and B € R/**. 
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68 = arg min ||Y — 6]|3 + y 19/2 (3.61a) 

6 
= (PTP +y) '@'Y. (3.61b) 


The parameter y directly affects the condition number of (ØT® + y1) whose 
inverse defines the regularized estimate. In fact, the positive definite square matrix 
(TØ + yl) has eigenvalues (coincident with its singular values) equal to of +y. 
Therefore, 

o? +y 


d(T I,) = 
cond( +yIn) mag 


which can be adjusted by tuning the regularization parameter y. This means that 
regularization can make the LS problem well conditioned even when @ is rank 
deficient: if the smallest singular value is null one has 


of +y 


cond(@! @ + yl,) = 


3.4.1.1 Mean Squared Error 


Simple calculations of expectations with respect to the errors e;, with the regressors 
$i assumed to be deterministic, lead to 


EO) = (To +y PILOT 66 (3.62a) 

OR, = 68) — 0 =-(@7 & + yP-1) ly P Loo (3.62b) 
Cov(O®, AR) = &[(® — EOR GR — eRT] 

=(6' o+yP7!)-!626! øT o + yp}! (3.62c) 


MSE(6®, 69) = E (ÔR — 09) (68 — 69)" 
= Cov(6®, 68) + ôR GR)? 
= (07 d4yP—1) 167267 o +y? Plof PIHET o +y PDE, 
(3.62d) 


where Cov(6®, ÔR) is the covariance matrix of ÔR and MSE(6®, 0) is the MSE 
matrix of 6® function of the true model parameter 6p. Expression (3.62) shows 
clearly regularization’s influence on the statistical properties of 0®: 


e when y = 0, i.e., there is no regularization, ÊP reduces to 6&5 and MSE(68, 6p) 
reduces to o? (T )=!; 

e when y > 0, the regularized estimator 68 is biased and the MSE matrix of 
68 is decomposed into two components: the bias O8,,(08,,)7 and the variance 
Cov(O®, 68), By a suitable choice of the regularization matrix P and the regular- 
ization parameter y, the variance of 68 can be made “smaller” and, if the resulting 
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increase in the bias is moderate, an MSE matrix “smaller” than that associated to 
LS can be obtained. 


3.4.2 Equivalent Degrees of Freedom 


For a given regularization matrix P, we have seen (also deriving the structure of the 
MSE) that the regularization parameter y controls the influence of the regularization: 
as y varies from 0 to 00, the influence of the regularization 07 P710 becomes stronger. 
In particular, when y = 0 there is no regularization and 68 reduces to 64S, When 
y = œ the regularization term yT P710 overwhelms the data fit || Y — #0 l2 and 
one has 68 = 0. 

Often, it is more convenient to exploit a normalized measure of the influence of 
the regularization instead of considering directly the value of y. For this goal, we 
introduce the so-called influence or hat matrix: 


H=@P®'(@P@' +yly)t. (3.63) 


Such matrix is important since it connects the measured output Y with the predicted 
output Y = ®@®, i.e., one has 


Y = 668 = HY. (3.64) 


It is also important since its trace is indeed a normalized measure of the influence of 
the regularization. To see this, let A = ®P®" and consider its SVD 


A=UDU', 
where UUT = I and D is a diagonal matrix with nonnegative entries d?. Then, 
H = UDUT (UDU" + ylyUU") 1 = UD(D + yIy) UT. 
Since U is orthogonal, one has trace(U MUT) = trace(M), so that 


n 2 
trace(H) = trace(D(D + yIn) = X 2 — 
i=1 di ty 


The above equation implies that trace( H) is a monotonically decreasing function of 
y. It attains its maximum at y = 0 and infimum as y —> oo. In particular, for y = 0 
one has ÔR = 64S and the hat matrix H becomes H = (67 @)-'@7, implying 
that trace(H) = n if @ is full rank. For y — œ one instead has trace(H) — 0. 
Therefore, it holds that 0 < trace(H) < n. Hence, since n is the dimension of 0, i.e., 
the number of parameters in the linear regression model, trace( H) can be seen as the 
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Fig. 3.7 Polynomial regression: true function g(x) (blue line) and ridge regression estimates 
obtained with 16 different values of the regularization parameter 


counterpart of the number of parameters to be estimated in the LS context. In other 
words, in the regularized framework trace(/#7) plays the role of the model order. It 
thus becomes natural to call it the equivalent degrees of freedom for the ReLS-Q 
estimate 68, e.g., [21, Sect. 7.6], [4, p. 559]: 


dof (È) = trace(H). (3.65) 


The notation dof (y) will be also used in the book in place of dof (ôR) to stress the 
dependence of the equivalent degrees of freedom on the regularization parameter. 


Example 3.6 (Polynomial regression: ridge regression) As shown in Fig. 3.6, the 
regression matrix ® built in the polynomial regression Example (3.26) and (3.27) 
is ill-conditioned for large n. Here, we consider the case n = 16 (corresponding to 
a polynomial order 15) which leads to cond(®) = 1.49 x 10!!. To illustrate how 
ridge regression (3.61) can face the ill-conditioning, let y = y;,i = 1,..., 16, with 
yı = 0.01 and yış = 0.31 and y2, ..., 715 evenly spaced between yı and yış. For 
each y;, we then compute the corresponding ridge regression estimate (3.61) and plot 
the 16 estimates g(x) = b(x)7 ôR in Fig. 3.7. The fits (3.28) are shown in Fig. 3.8 
as a function of y. One can see that y = 0.11 gives the best performance obtaining 
a fit around 89%. Interestingly, such fit is larger than the best result obtained by 
LS through optimal tuning of the discrete model order, see Fig. 3.4. The base 10 
logarithm of the condition number of 67 + y I, as a function of y, is displayed 
in Fig. 3.9. One can see that the matrix is much better conditioned now. Figure 3.10 
plots the equivalent degrees of freedom of 68. Even if n = 16, the actual model 
complexity in terms of equivalent degrees of freedom is much smaller, around 4 for 
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Fig. 3.8 Polynomial regression: profile of the ridge regression fit (3.28) as a function of y. Large 


fit values are associated to estimates close to the true function 
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Fig. 3.9 Polynomial regression: profile of the base 10 logarithm of the condition 
To +y asa function of y 


number of 
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Fig. 3.10 Polynomial regression: profile of the equivalent degrees of freedom (3.65) as a function 
of y using ridge regression 


the tested values of y. Finally, the estimates of any component of 6 obtained using 
the different values of y are shown in Fig. 3.11. 


3.4.2.1 Regularization Design: The Optimal Regularizer 


A natural question is how to design a regularization matrix P and select y to obtain 
a “good” model estimate. From a “classic” or “frequentist” point of view, rational 
choices are those that make the MSE matrix (3.62d) small in some sense, as discussed 
below. For our purposes, it is useful to rewrite the MSE matrix (3.62d) as follows: 


ie »(P2'® "(POOP WEN (POP = 
MSE(6®, 69) = o +n + +h). 
y y? 2 y 


o 
(3.66) 


Then, it is useful to first introduce the following lemma. 


Lemma 3.1 (based on [9]) Consider the matrix 
M(Q) =(QR + D (QRQ + ZRO + D, 


where Q, R and Z are positive semidefinite matrices. Then for all Q 
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Fig. 3.11 Polynomial regression: profile of the estimates of each component forming the ridge 
regression estimate (3.61). For each value k = 0,..., 15 on the x-axis the plot reports the estimates 
of the coefficient of the monomial x* obtained by using different values of the regularization 
parameter y 


M(Z) x M(Q), (3.67) 


which means that M(Q) — M(Z) is positive semidefinite. 


The proof consists of straightforward calculations and can be found in Sect. 3.8.2. 

Using (3.66) and Lemma 3.1, the question which P and y give the best MSE of 
ÊP has a clear answer: the equation o? P = y 62 needs to be satisfied. Thus, the 
following result holds. 


Proposition 3.1 (Optimal regularization for a given 0, based on [9]) Letting y = 
o”, the regularization matrix 


P = 00i (3.68) 


minimizes the MSE matrix (3.66) in the sense of (3.67). 


Note that the MSE matrix (3.66) is linear in 005, . This means that if we compute 
68 with the same P for a collection of true systems 69, the average MSE over 
that collection will be given by (3.66) with 464 replaced by its average over the 
collection. In particular, if 69 is a random vector with & (005 ) = IT, we obtain the 
following result. 
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Proposition 3.2 (Optimal regularization for a random system 6, based on [9]) Con- 
sider (3.62d) with y = a7. Then, the best average (expected) MSE for a random true 
system 0o with & (6003 ) = T is obtained by the regularization matrix P = IT. 


Propositions 3.1 and 3.2 thus give a somewhat preliminary answer to our design 
problem. Since the best regularization matrix P = 905 depends on the true system 
0o, such formula cannot be used in practice. Nevertheless, it suggests to choose a 
regularization matrix which mimics the behaviour of 6904 . Using prior knowledge on 
the true system 69, this can be done by postulating a parametrized family of matrices 
P(n) with n e I c R”, where n is the so-called hyperparameter vector, I” is the 
set where 7 can vary and m is the dimension of 7. Thus, the choice of a parametrized 
regularization matrix is similar to model structure selection in system identification. 
The nature of the optimal regularizer suggests also to set 


y =0°. (3.69) 
However, the noise variance ø? is in general unknown and needs to be estimated 
from the data. One can adopt equations (3.22) or (3.23). Another option is to include 
o? in ņ and then estimate it together with the other hyperparameters. 


3.5 Regularization Tuning for Quadratic Penalties 


3.5.1 Mean Squared Error and Expected Validation Error 


Now, assume that a parametrized family of regularization matrices P(7) has been 
defined. The vector 7 is in general unknown and has to be tuned by using the available 
measurements. The ReLS-Q estimate 6 R(n) in (3.58) depends on 7 and the estimation 
strategy depends on the measure used to quantify its quality. We will consider the 
following two criteria: 


e minimizing the MSE; 
e minimizing the expected validation error (EVE). 


3.5.1.1 Minimizing the MSE 


Still adopting a “classic” or “frequentist” point of view, a rational choice of 7 is one 
that makes the MSE matrix (3.62d) small in some sense. For ease of estimation, a 
scalar measure is often exploited. In [25, Chap. 12], it is suggested to use a weighting 
matrix Q and trace(MSE(68(n), 0o) Q) as a quality measure of 6R(n), where Q 
reflects the intended use of the model ô R(n). Then an estimate of n, say 7, is obtained 
as follows: 
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f = arg min trace(MSE(6® (n), 00) Q). (3.70) 


ner 


Note that (3.70) depends on the true system 6 that is unknown and thus cannot be 
used. In practice, we need to first find a “good” estimate, say 0, of the true system 8o 
and then to replace 6o in (3.70) with 6. Then, hopefully, a “good” estimate is given 
by 


fj) = arg min trace(MSE(6®(n), 6) Q). (3.71) 


ner 


Different choices of Ô and Q lead to different estimators (3.71). Examples are 
obtained setting Ê to the LS estimate or to the ridge regression estimate of 69, while 
the choice Q = I, is often used. In any case, the major difficulty underlying the idea 
of “minimizing the MSE” for hyperparameters tuning lies in whether or not Ê isa 
“good” estimate of 69, which is actually our fundamental problem. 


3.5.1.2 Minimizing the EVE 


An alternative quality measure of R(n) is related to model prediction capability 
on independent validation data and is characterized by the expected validation error 
(EVE). 

To define it, we need to introduce the training/estimation data and the validation 
data. The training data is used for estimating the model and is contained in the set 
2r. The validation data are used to assess model prediction capability and are in the 
set Dy. 

Now, let AR(n) denote a general ReLS-Q estimate parametrized by the vector n 
and obtained using only the training data Zr. Let yy € R, dy € R” be a validation 
sample pair. These objects could both be random, e.g., yy can be affected by noise 
and the regressor could be defined by a stochastic system input. The validation error 
EVE 9, (n) is then given by 


EVE9, (9) = ELO — H7 RM)? |2]. (3.72) 


In the above equation, the expectation & is computed w.r.t. the joint distribution of 
yy and @y conditioned on the training data Yr. If øy € R” is deterministic and, as 
usual, yy is affected by a noise independent by those entering the training set, the 
mean is taken just w.r.t. such noise, with Yr which influences only 68 In any case, 
the result is a function of the training set. Now, we can see 2r as random and then 
the EVE is 


EVE(n) = &[EVE>,(n)I, (3.73) 
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where the expectation & is over the training set. Note that the final result is function 
of the true 6) which determines the probability distributions of the training and val- 
idation data. 


The EVE(7) measures the prediction capability of the model 68 (n) before seeing 
any training or validation data: the smaller the EVE(7), the better the expected model 
prediction capability. Therefore, it is natural to estimate 7 as follows: 


f) = arg min EVE(n). (3.74) 
ner 


However, as said, the above objective depends on the unknown vector 6p so that esti- 
mation of 0 is not possible in practice. The problem is analogous to that encountered 
when trying to tune 7 by minimizing the MSE 


Remark 3.2 Interestingly, the idea of “minimizing the MSE” and the idea of “min- 
imizing the EVE” are connected. To see this, we assume for simplicity that the 
regressors ¢;, i = 1,..., N in the training data and @, in the validation data are 
deterministic. Then it can be shown that 


EVE(n) = Ely, — $268 (n))?] = o? + 67 MSE(O*(n), 40). (3.75) 


where the expectation & is over everything that is random, and MSE(6® (n), 0o) is 
the MSE matrix of ôR (n) defined in (3.62d). Clearly, (3.75) shows that minimizing 
EVE(n) with respect to 7 is equivalent to minimizing trace(MSE (68 (n), 90) Q) with 
respect to n when Q = ¢,¢/. 


To overcome the fact that the EVE depends on the unknown 6, we could first 
find a “good” estimate of EVE(7) using the available data and then determine the 
hyperparameter vector by minimizing it. There are two ways to achieve this goal: 
by efficient sample reuse of the data and by considering the in-sample EVE instead. 
More details will be provided in the next two subsections. 


3.5.2 Efficient Sample Reuse 


One way to estimate EVE(7) by exploiting efficient sample reuse includes cross- 
validation (CV) [41] and its variants already mentioned in Sects. 2.6.3 and 3.2.2 
when discussing model order selection. 
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3.5.2.1 Hold Out Cross- Validation 


The simplest CV is the so-called hold out CV (HOCV), which is widely used to 
select the model order for the classical PEM/ML. The HOCV can also be used to 
estimate the hyperparameter n € I” for the ReLS-Q method. 

The idea of hold out CV is to first split the given data into two parts: the training 
data Žr and the validation data 2y. The prediction capability is measured in terms 
of the validation error. The model that gives the smallest validation error will be 
selected. More specifically, the HOCV takes the following three steps: 


(1) Split the given data into two parts: 2r and Ay. 
(2) Estimate the model F (n) based on Yr for different values of n € T. 
(3) Calculate the validation error for 6È (n) over the validation data 2y: 


CVn) = X ow-n), 


Ow.gvEDy 


where the summation is over all pairs of (yv, øy) in the validation data Y,. Then, 
select the value of 7 that minimizes CV (n): 


7 = argmin CV(n). (3.76) 


ner 


It is also possible to change the role of the training and validation sets in order to 
perform a second validation step: the model is estimated on the previous validation 
set and the validation error is computed on the previous training set. Finally, the final 
validation error is obtained by averaging the two validation errors. 


3.5.2.2 k-Fold Cross-Validation 


The HOCV with swapped sets is a special case of the more general k-fold CV with 
k = 2, e.g., [24]. If the data set size is small, the HOCV may perform poorly. In fact, 
the training data may not be sufficiently rich to build good models and a validation 
set of small size may give a too uncertain validation error. In this case, the k-fold CV 
with k > 2 could be used. 

The idea of k-fold CV is to first split the data into k parts of equal size. For 
every 7 € I, the following procedure is repeated k times. At the ith run with i = 
12 yay Ke 


(1) Retain the ith part as the validation data 2y ;, and use the remaining k — 1 parts 
as the training data 2r, —;. 

(2) Estimate 6 R(n) based on the training data Zp _; and then calculate the validation 
error over the validation data Ay ; 
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CV_;(n) = >. Oy = T OR(n))?, 


(WOVE Dv, i 


where the summation is over all pairs of (yy, y) in the validation data Wy ;. 


Finally, the k validation errors CV_;(7) so obtained are summed to obtain the fol- 
lowing total validation error for 7: 


k 
CV(n) = $` CV: (n), 
i=l 


and the estimate of 7 is finally given by 


ù = argmin CV(n). (3.77) 
ner 


3.5.2.3 Predicted Residual Error Sum of Squares and Variants 


The computation of the k-fold CV is often expensive and an exception is the leave- 
one-out CV (LOOCV) where the validation set includes only one validation pair. 
When the square loss function is used, the total validation error admits a closed-form 
expression and the LOOCV is also known as the predicted residual error sum of 
squares (PRESS), e.g., [2]. 

First, recall the linear regression model (3.13) and the corresponding data y; € R 
and ġ; € R” fori = 1,..., N. Then the ReLS-Q estimate is 


6® = arg min || Y — D8||2 +0207 P7! (n0 
8 


= (DTP +0? PM) BTY (3.78) 


N -l y 
(Zoe a) > diyi, 
i=1 


II 


i=l 


where we have set y = ø? following (3.69). For the kth measured output yg, the 
corresponding predicted output }, and residual r, are, respectively, 


N 1 y 

=p 2 pip? + ao) Y piyi, (3.79a) 
i=1 i=1 

rk = Yk Ye: (3.79b) 


Then, PRESS selects the value of n € I” that minimizes the sum of squares of the 
validation errors. One can prove that this corresponds to the following problem: 
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N 


` , rê 
PRESS: ù= Eú 2 UcM u (3.80) 
where r are defined by (3.79) while 
N 
M=} _ pig +0° P(N). (3.81) 


i=1 


The derivation of (3.80) can be found in Sect. 3.8.3. It is worth noting that the denom- 
inator in (3.80) is strictly related to the diagonal entries of the hat matrix H defined 
in (3.63). In fact, 

pE Mok = Hir 


so that 
2 


N 
. ô ; Vk 
PRESS: 7 = arg min 2 a= Aw? 


ner 


Hence, interestingly, one can conclude that PRESS evaluation requires to compute 
just the ReLS-Q estimate exploiting the full data set (instead of solving N problems, 
one for each missing measurement in the training set). 

One method that is closely related with PRESS is the so-called generalized cross- 
validation (GCV), e.g., [18]. GCV is obtained by replacing in (3.80) the factors Hgg 
by their average, i.e., trace(H)/N: 


N 


GCV: 4 = arg min À >o (3.82) 
ner (1 — trace(H)/N)? — © 


Recalling (3.65), the term trace(H) defines the degrees of freedom of 6, Hence, 
the GCV criterion can be rewritten as follows: 


N 


1 
GCV: ñ= i 2: 
f= arg min T arg /N 2" 


3.5.3 Expected In-Sample Validation Error 


In the definition of the validation error EVEg, (3.72), reported for convenience also 
below 


EVE9,(n) = Ely — 7 68 (n))?|PrI, 
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we assumed that the conditional expectation & is over the independent validation 
sample pair yy € R, dy € R”, which are drawn randomly from their joint distribution. 
The computation of the validation error (3.72) could become easier if independent 
validation sample pairs yy € R, ¢, € R” are generated in a particular way. 

For linear regression problems, it is convenient to assume that the same determin- 
istic regressors @;, i = 1,2,..., N, are used for generating both the training data 
and the validation data. To be specific, still using 9) to denote the true parameter 
vector, we recall from (3.6), that the training output samples are 


yi = 6 0o + ei, T= een fe (3.83) 
In this case, the training set is 
Dr = {(Yi, Qi) | vi € R, pi € R” satisfying (3.83), i=1,..., N}. (3.84) 


Using the same regressors ¢;, consider a set of validation output samples yy,; as 
follows: 


Wi = 0 90 + eyi, LS ees Na (3.85) 


where 6o is the true parameter vector, with the noises e; and ey ; assumed identically 
and independently distributed. The validation error is now denoted by EVEing, (7), 
computed as follows: 


1 Č r 
EVEing, (n) = W > Elvi — $/ OR (n))?|Prl, (3.86) 


i=1 


and called in-sample validation error [21, p. 228]. Note that, similarly to what dis- 
cussed after (3.72), the expectation & in (3.86) is computed w.r.t. the joint distribution 
of the couples y, ;, pi conditioned on the training data 2r. Thus, the result is function 
of the training set. As done in (3.73), we can remove such dependence by computing 
the expected in-sample validation error as 


EVEin(7) = @[EVEing, (1)], (3.87) 
with expectation taken over the joint distribution of the training data. In what follows, 


we will see how to build an unbiased estimator of EVE;, (7) using the training data 
(3.84), and how to exploit it for hyperparameters tuning. 


3.5.3.1 Expectation of the Sum of Squared Residuals, Optimism 
and Degrees of Freedom 


To estimate EVE;, (7), consider the sum of squared residuals 
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1 N 

ETM = GL Vi - oF OP)”, (3.88) 
i=l 


which is function only of the training set. Its expectation w.r.t. the training 
data (3.84) is 


N 
pin 1 x 
a(n) = & (i Xo- araneo) i (3.89) 
i=1 

One expects EVEin (n) to be not smaller than err(7) because this latter quantity 
exploits the same data to fit the model and to assess the error. This intuition is indeed 


true as shown in the following theorem whose proof is in Sect. 3.8.4. 


Theorem 3.7 Consider the linear regression model (3.13) with the training data 
(3.84), the validation data (3.85) and the ReLS-Q estimate (3.58). Then it holds that 


err(n) < EVEin(n). (3.90) 
Theorem 3.7 shows that the expectation of the sum of squares of the residuals is 
an overly optimistic estimator of the expected in-sample validation error EVE;, (7). 
The difference between EVE;, (7) and err(7) is called the optimism in statistics. In 
particular, one has, see, e.g., [21, p. 229]: 
EVEjn (1) = err(7) + optimism(7), (3.91) 
where rewriting (3.83) as 
Y = h +E, (3.92) 
and defining the output prediction as 
Pn) = DÊ (n), 
it holds that 


1 > 
optimism(7) = 2a trace(Cov(Y, Y(7))) > 0. (3.93) 


Combining arguments contained in the proof of Theorem 3.7 reported in the 
appendix to this chapter, see, in particular, (3.164), with the definition of equivalent 
degrees of freedom in (3.65), one obtains that 


trace(Cov(Y, ¥(n))) = o2dof(®(n)). (3.94) 
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This thus reveals the deep connection between the optimism and the equivalent 
degrees of freedom. 


3.5.3.2 An Unbiased Estimator of the Expected In-Sample Validation 
Error 


Exploiting (3.94), we can now rewrite (3.91) as 


dof(6®(n)) 


EVEin() = ET (1) + 207 x 


(3.95) 


Interestingly, on the left-hand side of (3.95), EVEin(n), by definition (3.87), is the 
mean of a random variable which depends on both the training data (3.84) and 
the validation data (3.85). Instead, on the right-hand side of (3.95), err(7) is the 
expectation of a random variable which depends only on the training data. Hence, 
an unbiased estimator EVE;, (77) of EVEi,(7) is obtained just replacing err(7) with 
err(7)g, reported in (3.88). One thus obtains 


EVE, dof(6® 
EVE, (n) = i)a, +202 C 


1 ‘ dof(@® 
= SIY = DPO + 552 Oe) 


(3.96) 
So, after observing the training data (3.84), the hyperparameter 7 can be estimated 
as follows: 


A 1 i dof(6® (n)) 
ñ = arg min —||Y — DE MIZ + 207 ———. 
ner N i N 


(3.97) 


The hyperparameter estimation criterion (3.97) has different names in statistics: it 
is known as the CP statistics, e.g., [27] and Stein’s unbiased risk estimator (SURE), 
e.g., [40]. 

Interestingly, as it will be clear from the proof of Theorem 3.7, the above formula 
(3.97) still provides an unbiased prediction risk estimator also if we replace ®6p in 
(3.92) with a generic vector u s.t. Y = u + E. Hence, one does not need to assume 
the existence of the true 6o and of a regression matrix which describes the linear 
input-output relation. A variant of the expected in-sample validation error is also 
discussed in Sect. 3.8.5. 
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3.5.3.3 Excess Degrees of Freedom* 


In the previous subsection, we have discussed how to construct an unbiased esti- 
mator of the expected in-sample validation error, see (3.96), and how to use it for 
hyperparameters tuning, see (3.97). Irrespective of the particular method adopted 
for hyperparameter estimation, the estimate 7 of n depends on the data Y, with the 
regression matrix ® here assumed deterministic and known. We stress this by writing 


fj = 7(Y). 
Accordingly, the ReLS-Q estimate (3.58) with 7 replaced by 7(Y) becomes 
ERAY) = (PTP +07? Ph (H(Y))) 1OTY. (3.98) 


Since 7 is a random vector, to design a true unbiased estimator of the expected in- 
sample validation error of dR (A(Y)) one should not use (3.96) since it assumes the 
hyperparameter 7 constant. 

In what follows, we will derive an unbiased estimator of the expected in-sample 
validation error of 6 (A(Y)). Such an estimator will thus be able to account also for 
the price of estimating model complexity (the degrees of freedom) from data. To this 
goal, we need the following version of Stein’s Lemma [40], a simplified version of 
which was already introduced in Chap. 1. 


Lemma 3.2 (Stein’s Lemma, adapted from [40]) Consider the following additive 
measurement model: 


x=pte, x, u,£ ER, 


where u is an unknown constant vector and € ~ N (0, X). Let (A(x) be an estimator 
of based on the data x such that Cov({i(x), x) and ee) exist. Then 


ax 
à aĝ 
Cov(fi(x), x) = e( aa) X; 
ox 
Let 
Yv,1 ey 
Yv,2 ey,2 
Y= . , y= 5 , (3.99) 
Yv, N ev, N 


so that (3.85) can be rewritten as 


Y, = boo + Ey. (3.100) 
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Now, let us consider the measurements model (3.92) and the validation data 
(3.100), assuming also that 


E~N(0,07Iy), Ey ~ N(O,07Iy). (3.101) 


Then, using the correspondences 


x =Y, u = Doo, B(x) = 6° G(X), 4 = Yy, € = V, č = E, X =a" ly 
FO, N = PEAY) = PPTP +0? P1(H(Y))) 1OTY, 


together with (3.161) in the appendix to this chapter, one can prove that 


elg EMY, — PEYI |r| -¢ e[l- ofA] 
a E 


1 EVEin (n) err() 
= 2— trace(Cov(¥, DOR (H(Y)))). 


Using Stein’s Lemma, one has 


Cov(Y, BOR (A(Y)) = oa TO) 


2, Of, À) af Y, 4) wl 
=o él l+o 2 gf a 


ay! 


If, À) 
an 


= 0° &[(0(@' & +52 P~ eo oes 


ay! 


Therefore, it holds that 


EVEin = @tr(7) + 20? — Sltracel POYDO (> PHY) ®" +o7Iy))] 


af, A) añ 


1 
+ 20° a ae ay) 


af (Y, A) df 


= Gt 20? 
Beyer ee an aY 


JD). (3.102) 


dof (68 AY 1 
of(0~^(N(Y)) +20? trace(é[ 
N N 


If 7 = N(Y) were independent of Y, the above objective would coincide with the 
SURE score reported in (3.97). The difference is instead the presence of the term 
20? x trace(é py Hy). It represents the extra optimism induced by the estima- 
tion of ņ and is due to the randomness of the data Y entering the hyperparameter esti- 
mator. The term trace(& [2E ore) î) ii $) is called the excess degrees of freedom [33] and 
denoted by 


af Y, ñ) an 


exdof(O® (A(Y)) = trace(6[——-— En 


—])). (3.103) 
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From (3.102), we readily obtain an unbiased estimator of EVE;, as follows: 


ae 


= dof(6(A(Y dof (¥ (4 
EVE, = F(n)o, +20? of 6" (7(Y)) 4202% of(Y (7)) 
N N 
1 Ranke dof 68 Y )) 
= Varn + 20° ———— 
1 af (Y, A) an 
207 — trace , 3.104 
+20 7 race( T Ta ( ) 


where exdof(¥ (#)) is an unbiased estimator of exdof(Y (7)). As discussed in [33], 
(3.104) can be used to compare different regularized estimators also in terms of the 
different complexity of the hyperparameters tuning strategies that they adopt. 


3.6 Regularized Least Squares with Other Types 
of Regularizers x 


The general ReLS criterion assumes the following form 


ôR = arg min ||Y — ©0)3 + y J). 
0 


The different choices of the regularization term J (0) depend on the prior knowledge 
regarding 69. Having discussed the quadratic penalty, we will now consider two other 
important choices for J(@) given by the £1- or nuclear norm. 


3.6.1 £1-Norm Regularization 
ReLS with £1-norm regularization leads to 
ĝR = arg min ||Y — S8012 + yll4ll1, (3.105) 
8 


where ||8||ı represents the £1-norm of 6, i.e., ||O||1 = };—; |6;| with 8; being the ith 
element of 0. The problem (3.105) is also known as the least absolute shrinkage and 
selection operator (LASSO) [42] and is equivalently defined as follows: 


arg min ||Y — poll, subj. to ||ðllı < £, (3.106) 
6 


where £ > 0 is a tuning parameter connected with y that controls the sparsity of 0. 
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3.6.1.1 Computation of Sparse Solutions 


LASSO (3.105) has been widely used for finding sparse solutions. In signal process- 
ing, such problem has wide applications in compressive sensing for finding sparse 
signal representations from redundant dictionaries. In machine learning and statis- 
tics, the problem has also been applied extensively for variable selection where the 
aim is to select a subset of relevant variables to use in model construction. 

Recall that a vector 6 € R” is said to be sparse if || ||o < n, where ||9|| is the £o 
norm of 0 which counts the number of nonzero elements of @. For linear regression 
models, sparse estimation requires to find a sparse 0 able to well fit the data, i.e., 
such that ||Y — 01|3 is small. More formally, the problem is defined as follows: 


min |lØllo, subj.to IY — D013 < £, (3.107) 


where Y € R”, 0 € R” withn > N,® € R%*" assumed of full rank, i.e., rank(®) = 
N, and e > 0 is a tuning parameter that controls the data fit. 

The problem (3.107) is known to be NP-hard, e.g., [31]. It is combinatorial 
and finding its solution requires an exhaustive search. Hence, one needs approxi- 
mated methods. The most popular technique relies on a convex relaxation of (3.107) 
obtained by replacing the €9-norm with the €;-norm: 


min ||0||1, subj.to ||¥ — @6||3 < e. (3.108) 


By using the method of Language multipliers, it can be shown that the convex relation 
(3.108) is equivalent to LASSO (3.105). 

A natural question is whether or not the solution of LASSO (3.105) can be sparse. 
The answer is affirmative. For illustration, we first show this feature when the regres- 
sion matrix ® is orthogonal and assuming N = n. 


3.6.1.2 LASSO Using an Orthogonal Regression Matrix 


Let us consider (3.105) with orthogonal regression matrix ®, i.e., PTO = GG? = 
[,. Then (3.105) is rearranged as follows: 


68 = arg min | (87E) 1 @7 (Y — 68)|2 + y ll8 ll 
0 
= arg min ||6'S — ||} + ylllli 
0 
= arg min ) (Ô — 6;)? + y6, (3.109) 
? ai 


where gis is the ith element of 645. 
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To derive the optimal solution 68, we first recall the definition of subderivative 
and subdifferential of a convex function f : X —> R with X being an open interval. 
The subderivative of a convex function f : X — R ata point x in the open interval 
X is areal number a such that 


f(x) — f(xo) = a(x — xo) 


for all x in X. It can be shown that there exist b and c with b < c such that the set 
of subderivatives at xo for a convex function is a nonempty closed interval [b, c], 
where b and c are the one-sided limits defined as follows: 


pe tm IOS i F@) = Fo) 
= lim —W——__,, = lim —————. 


XX X — Xo X>XG xXx — Xo 


The closed interval [b, c] is called the subdifferential of f (x) at the point xo. 
Then, considering (3.109), ÔÈ is an optimal solution if 


—2615 — 68) + yal@®| =0, i =1,2,...,n, (3.110) 


where ÔR is the ith element of 68 and 4|48| represents the subdifferential of |R| 
which is equal to 


. ion(68)} ôR 
a|6®| = ae a a ,i=1,2,...,n. (3.111) 


Using (3.110) and (3.111), we obtain the following explicit solution of LASSO for 
orthogonal @: 


A 


ÔR = sign (ô!) min fo, ôS] — z] PAD et (3.112) 


From (3.112) one can see that the solution of LASSO will be sparse if many absolute 
values of the elements of LS are smaller than y /2. So, y can be used to tune the 
sparsity of @. It can also be seen that the nonzero elements of the solution of LASSO 
are biased and that, compared with the LS solution, they are shrunk towards zero 
(translated towards zero by a constant factor y /2). 


3.6.1.3 LASSO Using a Generic Regression Matrix: Geometric 
Interpretation 


For a generic non-orthogonal #, LASSO in general has no explicit solutions. To 
understand why it can still induce sparse solutions, we can use the geometric inter- 
pretation of LASSO in the form of (3.106) with 6 € R?. In Fig. 3.12, one can see 
that for the first case coloured in blue (resp., the third case coloured in brown), if 
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02 


01 


Fig. 3.12 Geometric interpretation of the solution of LASSO in the form (3.106) with non- 
orthogonal ® and 0 = [0] 62]! € R?. First, the large grey square represents the constraint 
lê llı < 8. Then, three cases are considered here and coloured in blue, red and brown, respectively. 
For each case, the tiny square represents the least squares estimate 64S, the elliptical contours rep- 
resent the level curves of ||Y — 0 \|2 centred at 6/8 and the cross represents the solution of LASSO 
(3.106). For the first case coloured in blue, the cross happens at the top corner of the large grey 
square and implies that the 6;-element of the solution of LASSO (3.106) is zero. For the second 
case coloured in red, the cross and the tiny square coincide and imply that the least square estimate 
615 is also the solution of LASSO (3.106) whose two components are both nonzero. For the third 
case coloured in brown, the cross happens at the right corner of the large grey square and implies 
that the 62-element of the solution of LASSO (3.106) is zero 


the elliptical contour is rotated slightly about the axis perpendicular to the paper and 
through the blue (resp., brown) cross, the optimal solution of (3.106) will still have 
a zero 6,-element (resp., 62-element). This explains why LASSO can often induce 
sparse solutions with a suitable choice of the regularization parameter. 

Finally, since the cost function of LASSO (3.105) is a convex function of 0, many 
standard convex optimization software packages are available to obtain numerical 
solutions of LASSO very efficiently, such as YALMIP [26], CVX [19], CVXOPT [3], 
CVXPY [11]. 


Example 3.7 (Polynomial regression-LASSO) We revisit the polynomial regression 
Examples (3.26) and (3.27) with LASSO (3.105). In particular, we set the model 
order to n = 16, with the regression matrix ® built according to (3.4) and (3.12). 
Moreover, we let y = y;,i = 1, ..., 16 with yı = 0.01, yış = 0.31 and yo, ..., V15 
evenly spaced between yı and y16. For each y = y;, we compute the corresponding 
solution of the LASSO (3.105). In particular, the estimates (x) = @(x)"6® for 
x = xi, With i =1,..., 40, are plotted in Fig. 3.13. 
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Fig.3.13 Polynomial regression: true function g(x) (blue) and LASSO estimates (thin) for different 
values of the regularization parameter y 


The model fits (3.28) obtained for different y are shown in Fig. 3.14. One can see 
that y = 0.15 gives the best result. 

Finally, the LASSO estimates of the components of 0 obtained using the different 
values of y are shown in Fig. 3.15. It is evident that the LASSO estimate (3.105) 
is sparse. Comparing it with the ridge regression estimates reported in Fig. 3.11, 
one can conclude that LASSO may give a simpler model, i.e., depending only on a 
limited number of components of 0. 


3.6.1.4 Sparsity Inducing Regularizers Beyond the £; -Norm 


We have seen that the £;-norm plays a key role for sparse estimation. However, 
as shown in [34], there are many other sparsity inducing regularizers. Let / be any 
concave and nondecreasing function on [0, 00), three examples being reported in 
the top panel of Fig. 3.16. Then, other penalties which promote sparsity assume the 
form J (6) = >~7_, (67) and are given by 


n = 6? . p 
L JI@=} al, peo, 2, 
i=l 


In) =n?, p € (0,2) 


a (3.113) 
To! IO) = Yo ogle: + e). 
i=1 


1m) = log(|n|2 +), € > 0 
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Fig. 3.14 Polynomial regression: profile of the model fit (3.28) obtained by LASSO as a function 
of the regularization parameter y 
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Fig. 3.15 Polynomial regression: profile of the estimates of each component forming the 
LASSO estimate (3.105). For each value k = 0,..., 15 on the x-axis the plot reports the estimates 
of the coefficient of the monomial x* obtained by using different values of the regularization param- 
eter y 
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Fig. 3.16 The top panel 
shows profiles of /(0;) given 
by 90-08, log(6?° + 1) and 
025 with 6; ranging over 
[0, 1]. The bottom panel 
displays profiles of sparsity 
inducing penalties / (6?) 
given by |6;|°"1, 

log(|6;| + 1) and |6;| with 6; 
ranging over [—1, 1] 
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Some of them are displayed in the bottom panel of Fig. 3.16. The use of nonconvex 
penalties may increase the sparsity in the solution but the drawback is that optimiza- 
tion problems possibly exposed to local minima must be handled. 


3.6.1.5 Presence of Outliers and Robust Regression 


In practical applications, it may happen that the measurement outputs y; so far 
described by the model 
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yi = Q7 0o + ei, i=1,...,N 


may be contaminated by outliers which represent unexpected noise model deviations. 
They can be due to the failure of some sensors or to mistakes in the setting of the 
experiment. In this case, data can actually be generated by the following system: 


yi = pl bo +e+v0i, i=1,...,N, (3.114) 


where the e; form a white noise with mean zero and variance øo? while the vo; 
represents the outliers which are assumed to be zero most of time. Hence, the vector 


Vo = [ vo, VO,2 «++ von] 


is assumed to be sparse. 

When data come from (3.114), straightforward application of the LS method may 
lead to a poor estimate ĝLS of 0o. For illustration, let us consider an extreme case by 
assuming vo; = 0 fori = 1,2,..., N — 1 while the 167 0o + eil fori = 1,..., N 
are all negligible compared to |vo y|. LS leads to 


N 
6s = argmin è (yi — $; 0)" 


N-1 
= arg min X ro +e — poy 
o st 
+ (6790 + en + Von — 940). 


The first N — 1 terms in the above cost function are the same encountered in absence 
of outliers while the last term is different due to vo, y. The DA o + eili =1,..., N 
are negligible compared to |vo,v|, a phenomenon then further amplified by the 
quadratic criterion here adopted. To make the last term as small as possible, ĝLS 
will mainly tend to fit only vo,y. Hence, the terms p7 Oo + e; which carry infor- 
mation on the true system will be little regarded. This will lead to a poor estimate 
of A . 

Many robust regression methods are available hinging on loss functions less sen- 
sitive to outliers than the square loss. An example is Huber estimation 


N 
@ Huber = arg min > Pee. = pro) (3.115) 
6 


i=1 


where the Huber loss function /#"°*" is defined as follows: 


2 


para) = | xX |x| < 


3.116 
yix| — ły? |x| > ei 


NISNIN 
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In (3.116), the parameter y > 0 is a tuning parameter whose role will become clear 
shortly. The Huber loss function (3.116) is less sensitive to outliers because it grows 
linearly for |x| > y /2. Note that a limit case of the Huber loss is the £;-norm obtained 
with y which tends to zero. 


3.6.1.6 An Equivalence Between £,;-Norm Regularization and Huber 
Estimation 


Let 


Consider the €,-norm regularization given by 


N 
arg min X O; — $7 8 — voi)? + y lvo.:| (3.117) 


@.Vo jay 


whose peculiarity is to require joint optimization w.r.t. the parameter vector 0 and 
the outliers vo ; contained in Vo. Interestingly, (3.117) is actually equivalent to Huber 
estimation (3.115), i.e., they have the same optimal solution. To show this, one needs 
just to prove that 


N 
> heey — glo) = min |¥ — Voll3 + yI Voll- (3.118) 


i=1 


The right-hand side of (3.118) corresponds to LASSO (3.105) with an orthogonal 
regression matrix given by the identity. It thus follows from (3.112) that the compo- 
nents of the optimal solution ve admit the following closed-form expression: 


P; = siga) min f0, 5l- É}, 2=1,2..,.N. 6119) 
Now we replace Vo in the cost function of the right-hand side of (3.118) with vR 


and it is straightforward to check that the following identify holds: 


N 
do eeg, — G70) =F — PRI + yI. (3.120) 
i=1 


Therefore, (3.117) is indeed equivalent to the Huber estimation (3.115). 
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3.6.2 Nuclear Norm Regularization 


So far the output Y, the parameter 6 and the noise E in (3.13) have been assumed to 
be vectors. In what follows, we allow them to be matrices and consider the following 
linear regression model: 


Y = bh +E, Y ERY”, eR”,  €R”™", EcR™™. (3.121) 
The ReLS with nuclear norm regularization takes the following form: 


ôR = arg min IY — 801} + y IAO (3.122) 


where || - || 7 is the Frobenius norm of a matrix, A(0) is a matrix that is affine in 0 and 
||2(@) ||. is the nuclear norm of the matrix h(@), see also Sect. 3.8.1, the appendix to 
this chapter, for a brief review of matrix and vector norms. 


3.6.2.1 Nuclear Norm Regularization for Matrix Rank Minimization 


Matrix rank minimization problems (RMP) are a class of optimization problems that 
involve minimizing the rank of a matrix subject to convex constraints. They are often 
encountered in signal processing, image processing and statistics. For example, a 
typical statistical problem is to obtain a low-rank covariance matrix able to describe 
some available data and/or consistent with some prior assumptions. Formally, the 
RMP is defined as follows: 


min rank(X) 
RMP: ~* (3.123) 
subj.to X € € c R”, 


with X belonging to a convex set € while rank(X) describes the order (complexity) 
of the underlying model. 

In general, the RMP (3.123) is NP-hard and thus there is need for approximated 
methods. Several heuristic methods have been proposed, such as the nuclear norm 
heuristic [14] and the log-det heuristic [15]. In particular, for a convex set € the 
convex envelope of a function f : € — R is defined as the largest convex function 
g such that g(x) < f(x) for every x € €, e.g., [22]. For a nonconvex f, solving 


may be difficult. In this case, if it is possible to derive the convex envelope g of f, 
then 


min g(x) (3.125) 
xe€ 
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turns out a convex approximation of (3.124) and, in particular, the minimum of 
(3.125) can represent a lower bound of that of (3.124). Moreover, if necessary, the 
minimizing argument of (3.125) can be chosen as the initial point for a more com- 
plicated nonconvex local search aiming to solve (3.124). 

As shown in Theorem 1 of [13, Chap. 5], the convex envelope of the rank function 
rank(X) with X e€ € = {X|||X|lz < 1, X € R”*”} is the nuclear norm of X, i.e., 
|X ||. As a result, the nuclear norm heuristic to solve the RMP (3.123) is obtained 
by replacing the rank of X with the nuclear norm of X, i.e., 


min || X ||, 
Nuclear norm heuristic: * (3.126) 
subj.to X e € c RP”, 


Without loss of generality, we assume that X € € = {X | ||X|lo < M, X € R”*”} 
for some M > 0. Then, from the definition of the convex envelope, for X € € we 


have 
X 
M 


X 1 
< rank (=) => —|X|)x < rank(X). 
; M M 
In addition 
1 copt opt copt 
y!” Pry. < rank(X™®) < rank( X°, (3.127) 


where X° and X°°?! denote the optimal solution of the RMP (3.123) and that of the 
nuclear norm heuristic (3.126), respectively. The inequalities in (3.127) thus provide 
an upper and lower bound for the optimal solution of the RMP (3.123). 

As shown in [13, Chap. 5], the nuclear norm heuristic (3.126) can be equivalently 
formulated as a semidefinite program (SDP): 


min trace Y + trace Z 
XYZ 


(3.128) 


; Y X 
subj. to E A >0, X €€, 


where Y € R”*”, Z e R”*” and both Y and Z are symmetric. The SDP problem 
(3.128) can be solved by interior point methods. For this purpose, some convex 


optimization software packages which can be used include YALMTP [26], CVX [19], 
CVXOPT [3] and CVXPY [11]. 
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3.6.2.2 Application in Covariance Matrix Estimation with Low-Rank 
Structure 


Now we go back to the linear regression model (3.121) and the ReLS with nuclear 
norm regularization (3.122). Consider the problem of covariance matrix estimation 
with low-rank structure, e.g., [38]. In particular, in (3.121), we take N = m = n, let 
Y be a sample covariance matrix, ® = J, and 6 be a positive semidefinite matrix 
which has low-rank structure. Moreover, in (3.122), we take h(@) = 0. We can then 
obtain a matrix estimate 6® with low-rank structure using ReLS with nuclear norm 
regularization as follows: 


6® = arg min ||Y — 4/7; + lO lle, (3.129) 
0 


for a suitable choice of y > 0. An example is reported below. 


Example 3.8 (Covariance matrix estimation problem) First, we construct a block- 
diagonal rank-deficient covariance matrix 9p that has 4 blocks denoted by A; € R”*” 
with nı = 20, no = 10, n3 = 5 and n4 = 15. Using blkdiag to represent a block- 
diagonal matrix, one thus has 09 = blkdiag(A1, A2, A3, A4). Each A; is generated 
by summing up Vi jV jo j =1,...,n; — 2, where the v; j are n;-dimensional vectors 
with components independent and uniformly distributed on [—1, 1]. It comes that 
rank(@9) = 42 since the rank of each ith block is n; — 2. Then we draw 20000 
samples x; from the Gaussian distribution ⁄ (0, 69). The available measurements 
are z; = x; + e; where the e; are independent and distributed as -/ (0, 0.6). Using 
the z; we calculate the sample covariance Y as follows: 


20000 1 20000 
Y = — iz, T= i. -l 
50000 2 (zi — Z)(zi — Z) 2 = 0000 2 z (3.130) 


We solve the ReLS problem (3.129) with the data Y defined above and y in the set 
{0.1411, 0.1414, 0.1419, 0.1423, 0.1427}, obtaining different estimates ÔR of the 
covariance matrix. 

The top panel of Fig. 3.17 shows the base 10 logarithm of the 50 estimated 
singular values. Each profile is obtained with a different regularization parameter. 
Such results show that, seeing the tiny singular values as null, a suitable value of the 
regularization parameter, like y = 0.1427, leads to rank(68) = 42. Note in fact that 
the green curve, which is associated to such y, has a jump towards zero when passing 
from 42 to 43 on the x-axis. The influence of the nuclear norm regularization is also 
visible in the bottom panel which shows the profile of the relative error of 68 as a 
function of y. When y is small, e.g., y = 0.1411, the influence is invisible, AR is 
almost the same as the sample covariance Y and rank(68) = 50. When y becomes 
larger, the regularization influence becomes more visible, making 68 closer to the 
true covariance Oo. 


3.6 Regularized Least Squares with Other Types of Regularizers « 


Fig. 3.17 Covariance 
estimation with low-rank 
structure. Panel a shows the 
base 10 logarithm of the 50 
singular values of the 
estimated covariance matrix 
68 with different values of 
y. Panel b shows the profile 
of the relative error of the 
estimated covariance matrix 
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3.6.2.3 Vector Case: £;-Norm Regularization 


The nuclear norm heuristic and inequalities (3.127) also justify the use of the £4 -norm 
regularization (3.108) for the problem of finding sparse solutions (3.107). 

For the vector case, i.e.,0 € R”*” withm = 1, we can take X and Cin the previous 
section to be X = 0 and € = {0 € R"|||Y — pol? < £}. Then it is easy to see that 
the €,-norm is the convex envelope of the &9-norm for ||@||.. < 1, i.e., 


Alla < l@llo, for ]@lloo < 1. 
Then, the RMP (3.123) and the nuclear norm heuristic (3.126) become the problem of 
finding sparse solutions (3.107) and the €;-norm regularization (3.108), respectively. 
Similar to what is done to obtain (3.127), we assume that ||8 ||. < M for some M > 0. 
If Ilê llo < M, one has 
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1 copt opt copt 
ml ta s Ne lo Se o, (3.131) 


where 0°? and 0°°* denote the optimal solution of the problem of finding sparse 
solution (3.107) and that of the 2;-norm regularization (3.108), respectively. Similar 
to the matrix case, (3.131) provides an upper and lower bound for the optimal solution 
of the sparse estimation problem (3.107). 


3.7 Further Topics and Advanced Reading 


The systematic treatment of the regression theory is available in many textbooks, e.g., 
[12, 35]. The noise variance estimation is a critical issue in practical applications and 
has been discussed in details in [48]. When the regression matrix is ill-conditioned, 
it is important to make sure that the least squares estimate is calculated in an accu- 
rate and efficient way, e.g., [10, 17]. Moreover, for the regularized least squares in 
quadratic form, the regularization matrix could also be ill-conditioned. In this case, 
extra care is required in the calculation of both the regularized least squares esti- 
mate and the hyperparameter estimates, e.g., [8]. For given data, the quality of a 
model depends on the control of its complexity, which can be described by different 
measures in different contexts, e.g., the model order and the equivalent degrees of 
freedom. A good exposition of model complexity and its selection can be found in 
[21]. It is worth to mention that the degrees of freedom for LASSO have also been 
defined and discussed in [43, 51]. In practical applications, there are two key issues 
for the regularized least squares with quadratic regularization: the design of the reg- 
ularization matrix and the estimation of the hyperparameter. While the latter issue 
has been discussed extensively in the literature, e.g., [21, 36, 46, 47], there are much 
fewer results on the former issue in the context of system identification, as discussed 
in [7]. The asymptotic properties of some widely used hyperparameter estimators, 
such as the maximum marginal likelihood estimator, Stein’s unbiased risk estimator, 
generalized cross-validation, etc., have been reported in [29, 30]. LASSO and its 
variants have been extremely popular in practical applications, as described in [16, 
28, 32, 50]. The nuclear norm heuristic to solve matrix rank minimization problems 
has wide applications in practical applications, see, e.g., [5, 6, 14, 15, 37]. Beyond 
the Huber loss function [23], the square loss function can be replaced also by other 
convex functions like the Vapnik loss function [45] as discussed later on in Chap. 6. 


3.8 Appendix 


3.8.1 Fundamentals of Linear Algebra 


In this section, we review some fundamentals of linear algebra used in this chapter. 


3.8 Appendix 83 
3.8.1.1 QR Factorization and Singular Value Decomposition 


We begin with giving the definitions of QR factorization and SVD, which are very 
important decompositions used for many purposes other than solving LS problems. 
For any ® € RY*” with N > n, ® can be decomposed as follows: 


® = QR, (3.132) 


where Q € R%*" is orthogonal, i.e., QTQ = QQ" = Iy, and R € RY*” is upper 
triangular. Further assume that ® has full rank. Then ® can be decomposed as 
follows: 


p = QI Rı (3.133) 


where Qı = Q(:, 1:n) and R; = R(1 :n,1: n) with Q(:, 1: n) being the matrix 
consisting of the first n columns of Q and R(1 : n, 1 : n) being the matrix consisting 
of the first n rows and n columns of R. The factorizations (3.132) and (3.133) are 
called the full and thin QR factorization, respectively. In particular, when R; has 
positive diagonal entries, the thin QR factorization (3.133) is unique. 

We start providing the “economy size” definition of the SVD. For any ® € RY*" 
with N > n, ® can be decomposed as follows: 


= UAV", (3.134) 
where U € R*” satisfies UTU = Iy, A = diag(o1, 02,...,0,) with o1 > o2 > 
+++ > On Z 0, and V € R”*” is orthogonal. The factorization (3.134) is called the 
singular value decomposition (SVD) of @ and the o;, i = 1,...,n are called the 


singular values of ®. 

The SVD admits also the “full size” formulation, as given in (3.29). One has 
that (3.134) still holds but U is an orthogonal N x N matrix and A is a rectangular 
N x n diagonal matrix, while V is still an orthogonal n x n matrix. In this second 
formulation, V and U can be associated to orthonormal change of coordinates in the 
domain and codomain of ® such that, in the new coordinates, the linear operator is 
diagonal. 


3.8.1.2 Vector and Matrix Norms 


Important vector norms are the £1, £2 and £% norms. For a given vector 0 € R”, they 
are denoted by ||@|l1, ||@||2 and ||@||.0, respectively, and are defined as follows: 


All, = È. lê, (3.135) 
i=l 
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lolz = |J 6? (3.136) 
i=l 
[Alico = max{|O1|, 182l, -- - , [alts (3.137) 


where the £2 norm is also known as the Euclidean norm. 

Important matrix norms are the nuclear norm, the Frobenius norm and the spectral 
norm. For a given matrix ® € R*" with N >n, these three matrix norms are 
denoted by ||®||., ||@||p and ||®||2, respectively, and are defined as follows: 


Pl = $ ai(), (3.138) 
i=1 
N n n 
Plie = aa = Y 507), (3.139) 
i=1 j=l i=1 
P lla = max (P), (3.140) 


where o; (®) represents the ith largest singular value of ®, Omax(®) = 01(®) and 
@; ; is the (i, j)th element of ®. 

Now, we report some properties of the vector and matrix norms. The ith largest 
singular value of ® is equal to the square root of the ith largest eigenvalue of 7 &, 
or equivalently #7. If ® is square and positive semidefinite, then the nuclear 
norm of @ is equal to the trace of ®, i.e., ||®||,, = trace(®). For matrices A, B € 
R^” we can define the inner product on R*” x R“*" as (A, B) = trace(A’ B) = 
yi aa Aj, j Bi, j. So the Frobenius norm is the norm associated with this inner 
product. The spectral norm is defined as the induced 2-norm, i.e., for ® € Rx", 


Iel _ B 
= maximize ||®@|lo. (3.141) 


||@ |lz = maximize 
640 lA ll2 l@l2=1 


To show that (3.141) is equal to (3.140), note that maxyg),—1 || @6||2 is equivalent to 
maxa\2=1 || PO 3, which is further equivalent to 


max Iel? + ad — lloll) = max” P” po +a(1— 070), (3.142) 


where à is the Lagrange multiplier. Checking the optimality condition of (3.142) 
yields that the optimal solution will satisfy 


pT po — 10 = 0, 070 =1. 


The above equation implies that À is an eigenvalue of #7 ®, and moreover, 
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6'b' G0 = 1070 =). (3.143) 
As a result, we have 


max, ||®6|l2 = (max, 67670)? = ( max å)? = (Ama) ?, 
@\l2= lel3=1 


where Amax is the largest eigenvalue of TË that is equal to On x (P). Thus (3.141) 
is indeed equal to (3.140). 

The aforementioned three matrix norms, the nuclear norm, the Frobenius norm 
and the spectral norm, can be seen as natural extensions of the three vector norms: the 
£1, 2 and £% norms,, respectively. In particular, if we construct an n-dimensional 
vector with the n singular values of ® as its elements, then the three matrix norms 
IS |l, ||@llp and || ||2 correspond to the £1, £2 and £% norms of the constructed 


vector, respectively. Moreover, for any given norm || - || on RY*”, there exists a dual 
norm || - Ila of || - || defined as 
|| Alla = sup{trace(A’ B)|B € RY*", ||Bl| < 1}. (3.144) 


For the vector norms, the dual norm of the £4 norm is the £% norm and the dual 
norm of the £2 norm is the £2 norm. The properties for the vector norms extend to 
the matrix norms we have defined: the dual norm of the nuclear norm is the spectral 
norm, see, e.g., [37], and the dual norm of the Frobenius norm is itself. 


3.8.1.3 Matrix Inversion Lemma, Based on [49] 


The matrix inversion lemma is also known as Sherman—Morrison—Woodbury formula 
and refers to the following identity: 


(A+UCVY = A7! — ATIU (CT! + VATU) VAT, (3.145) 


where A and C are square n x n and m x m matrices. 


3.8.2 Proof of Lemma 3.1 


Define W = —(QR + 1,)~! and Wọ = —(ZR + I,)~1. Then (3.67) can be rewritten 
as 


W(QRQ+ Z)W' > Wo(ZRZ+ Z)Wy. (3.146) 


Note that 
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I, +W=-WQR, I, + Wo = —WoZR (3.147) 
thus (3.67) can be further rewritten as 


Un + WR, + W) + WZW" 
> (In + WDR Un + Wo)" + Woz Wd. (3.148) 


In the following, we show that 


(Un + WR! + W) + WZW 
— (In + Wo) R Un + Wo)” — Woz We 
= (W — Wo) (RT! + Z)(W — Wo)’. (3.149) 


Simple calculation shows that (3.149) is equivalent to 


(n + Wo) RTWT + WR (p + Wo) 
— (I + Wo) R WS — WoR™ (I, + Wo) 
=2WoZ Wf — WoZW' —-WZW,. (3.150) 
It follows from the second equation of (3.147) that 
(Un + WoR! = —WoZ. (3.151) 
Now inserting (3.151) into the left-hand side of (3.150) shows that (3.150) and 
thus (3.149) holds. Moreover, since (W — Wo)(R~! + Z)(W — Wo)? in (3.149) is 


positive semidefinite, Eq. (3.148) holds as well, which in turn implies (3.67) holds. 
This completes the proof. 


3.8.3 Derivation of Predicted Residual Error Sum of Squares 
(PRESS) 


For the case when the kth measured output yg, k = 1,..., N, is not used, the corre- 
sponding ReLS-Q estimate becomes 


-1 
N N 
=| Yo ao torm] YO giv (3.152) 
i=1,i 4k i=1,i 4k 
For the kth measured output yg, k = 1,..., N, the corresponding predicted output 


y_, and validation error r_, are 
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—1 
N N 
S= | DO ao +0?P Ti] D> oii (3.153a) 
i=1,i 4k i=1,iŻk 
r-k = Yk — Îr. (3.153b) 


With M defined in (3.81) and by Woodbury matrix identity, e.g., [10, 17], we 
have 


—1 
N 
XO pp tP m] =M- po)! 
i=1,i £k 
a Mooi M 


=M - : 3.154 
—1 + GTM dx en 


Then we have 


N N 
= Mpp M 
r-k = yk — p Mt ~ iYi + Op £ 5 Pi Yi 


4. AT V-1A 
i=1,i 4k 1+ pk M'A i=1,i k 
Mho M $ 
=r + 6, M dey, + 6 ——— > piyi 
k *_14 67M-¢, ozs 
N 
z p7 M1 
=r + 6, Mpk | x + —+~———_ bi Yi 
i -1 +o; Mh », 
pT Md (3.155) 


=r - 
t| 1+ Gf MG 


N 
x | —y + Of M bere + OM SO divi 
i=1,ifk 
pE Md " 
-1 +o Mh 
1 

a pe E 

1 — pf Mier 


which shows that r_; is actually obtained by scaling rą with a factor 1/(1 — 
o M~'®,). Accordingly, we have the sum of squares of the validation errors 


N 


2 ri 
i), aM? (3.156) 


k=1 k=1 


Then the PRESS (3.80) is obtained by minimizing (3.156) with respect ton € I’. 
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3.8.4 Proof of Theorem 3.7 


Using (3.92) and (3.100), it is easy to see that proving (3.90) is equivalent to show 
that 


1 X 1 À 
é ll = 6% t3| <E | ot = oani 2n] (3.157) 
a MM 
err(n) EVEin (7) 


and to prove the above inequality we need the following lemma. 


Lemma 3.3 Consider the following additive measurement model: 
x=ute, x,u,e€R’, (3.158) 


where u is an unknown constant vector and e is a random variable with zero-mean 
and covariance matrix £ (esT) = X. Let A(x) be an estimator of u based on the 
data x and let x be new data generated from 


X=pw+é, xeR’, (3.159) 


where € is a random variable uncorrelated with £ and has zero-mean and covariance 
matrix &(€&! ) = X. Then it holds that 


& (1% — (|Z) = Edu — AWI) + trace(Z) (3.160) 
= é (||x — fx) |I5) + 2trace(Cov(ji(x), x)), (3.161) 


where the expectation is over both £ and &. 


Proof Firstly, we consider (3.160). We have 


EdE- AOD = CUE — w+ w — ACI) 
= E(u — LOND + ENE — uli) + 261% - u)" u- Aa) 
= E (llu — ACIS) + (EID), 


which shows that (3.160) is true. 
Secondly, we consider (3.161). Similarly, we have 


EŠ — AWI = (WF — x +x — A(w) IID) 
= E(x — AIS) + FUE — xl?) + 2ELE — x)? (x — A@))] 
= E(x — ACI) + EE — llg) + 2ELE — £)" (e + u — Aœ] 
= &(\lx — A(x)II3) + 2trace(L) — We (e + u — A(x))] 
= E (lix — AII + 2Fle7 A), 
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which implies that (3.161) is true. 


Now we prove (3.157) by applying Lemma 3.3. Let 
x =Y, u = Doo, A(x) = PÔR, z = Y, e = E,F = Ees, X =o7ly, (3.162) 


and then it follows from (3.161) that 


1 X 1 n 
e [5E -otoan -e| Say - 2488] 


in aN a 
EVEin() errn) 


= i trace(Cov(Y, DÊR (n))). (3.163) 


Next we show that the right-hand side of (3.163) is nonnegative. For the ReLS- 
Q problem (3.58a) with the ReLS-Q estimate (3.58b), the predicted output Y (n) 
of Y is 
Y(n) = DÊ (n) = PPOT (OPET +07 Iy)'Y. 


Then we have 


Cov(Y, DÊR (n)) = Cov(Y, ¥(n)) 
= EY -&(Y))(¥ -EC n) 
= EY —&(Y))(¥ — E(Y))' OPO" (OPET +0o?Iy) t 
=0° PPT (EPET +07 ly) | = 0°H, (3.164) 


where H is the hat matrix defined in (3.63). One has 
trace(Cov(Y, ®68(7))) = o? trace(H) > 0. 


Therefore, the right-hand side of (3.163) is nonnegative and thus (3.90) holds true 
completing the proof of Theorem 3.7. 


3.8.5 A Variant of the Expected In-Sample Validation Error 
and Its Unbiased Estimator 


It is possible to derive variants of the expected in-sample validation error and its 
unbiased estimator by modifying (3.92) and (3.100). 

Assume that @ is full rank, i.e., rank(®) = n. Then, multiplying both sides of 
(3.92) and (3.100) with (©76)~!@7 yields 


(PTP) PTY = bo + (DTP) IDTE, (3.165) 
(6'd)'@'Y, = 0o + (PTP IDTE, (3.166) 
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which will be our new “true system” and new “validation data”, respectively. 
Different from (3.162), we now take 


x = (®7@) PTY, u = bo, A(x) = O (n), ¥ = (PTP) TPTY, 
e= (BTP) IPTE, č = (6' 6) IDTE, X =07(6' 6)". (3.167) 


Note that 648 = (@7)-1@7Y and then it follows from (3.160) and (3.161) that 


E(\(@? by“! y, — MID = GOR (n) — Ooll2) + 0? trace (PTE) 
= E (ÊS — OR) 3) + 2trace(CovG®(n), 6). 


From the above two equations, we have 
EEM — blh = EO — ôm)? 
+ 2 trace(Cov(6®(n), 64S)) — o? trace((@? &)~+). 
Further note that 
O(n) = (OT @ + 0? Pq) OY = (O'S + 0° Pn) OT O'S, 
Cov(6'S, 6'S) = 0° (8T), 
then we have 


EEM — bol) = FOS — Rm) I13) 


+ 207 trace((®’ P + 07 P~+(n)) ! —0.5(67 &)~4). 
(3.168) 


Note that £ (ÊB (7) — 0||3) is equal to trace(MSE(@® (n), 49), then we denote it 
by mse, and we readily obtain an unbiased estimator of mse, as follows: 


imse, = ||6'S — 68 (7) ||2 + 20? trace((@7 & + o? P~1(n)) + — 0.5(@7 &)}), 
(3.169) 


Now given the training data (3.84), the corresponding estimate mse, of mse, can 
be used to estimate the hyperparameter 7: we should take the value of n € I" that 
minimizes (3.169), i.e., 


fj = arg min ||6'S — 68(n)||2 + 207 trace((®7 & + 0o? P71 (n))+ — 0.5(@7 @)“}), 
ner 


(3.170) 
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The criterion (3.170) is known as the SURE of the expected in-sample validation 


error for the true system (3.165) and the validation data (3.166), e.g., [33, 40]. 
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Chapter 4 ®) 
Bayesian Interpretation creek 
of Regularization 


Abstract In the previous chapter, it has been shown that the regularization approach 
is particularly useful when information contained in the data is not sufficient to 
obtain a precise estimate of the unknown parameter vector and standard methods, 
such as least squares, yield poor solutions. The fact itself that an estimate is regarded 
as poor suggests the existence of some form of prior knowledge on the degree of 
acceptability of candidate solutions. It is this knowledge that guides the choice of 
the regularization penalty that is added as a corrective term to the usual sum of 
squared residuals. In the previous chapters, this design process has been described 
in a deterministic setting where only the measurement noises are random. In this 
chapter, we will see that an alternative formalization of prior information is obtained 
if a subjective/Bayesian estimation paradigm is adopted. The major difference is 
that the parameters, rather than being regarded as deterministic, are now treated 
as a random vector. This stochastic setting permits the definition of new powerful 
tools for both priors selection, e.g., through the maximum entropy principle, and for 
regularization parameters tuning, e.g., through the empirical Bayes approach and its 
connection with the concept of equivalent degrees of freedom. 


4.1 Preliminaries 


We have seen that the regularization approach can be used to effectively solve esti- 
mation problems that are otherwise ill-conditioned. In particular, a penalty is added 
as a corrective term to the usual sum of squared residuals. In this way, between two 
candidate solutions achieving the same squared loss, the regularizer is chosen such 
as to penalize candidate solutions that depart from our prior knowledge on some 
features of the unknown parameter vector. 

It is worth noting that the regularization approach lies within a frequentist 
paradigm in which the observed data, affected by noise, are random variables, but 
the unknown parameter vector is deterministic in nature. For linear-in-parameter 
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models, regularization yields an estimate that, though biased, may be preferable to 
the unbiased least squares estimate in view of the smaller variance. In particular, 
the tuning of the regularization parameter aims at an advantageous solution of the 
bias-variance dilemma. By trading an excessive variance for some bias, a smaller 
mean squared error may be achieved, as exemplified by the James—Stein estimator. 
An alternative formalization of prior information is obtained if a subjective/Bayesian 
estimation paradigm is adopted. The major difference is that the parameters, rather 
than being regarded as deterministic, are now treated as a random vector. 

In order to introduce the Bayesian paradigm, it can be useful to start with a simple 
example in which the parameters do depend on the result of a random experiment. 
Consider a metabolism model for which the parameter vector 6 can take only two 
possible values, 6;, and 64, associated with healthy and diabetic patients, respectively. 
The model specifies p(Y |0), where Y are observations collected from a randomly 
chosen patient with 90% probability of being healthy and 10% probability of being 
diabetic. In this simple case, model identification amounts to deciding between 0; 
and 04. It is also clear that 0 is a discrete random variable with p(0 = 0) = 0.9 
and p(@ = 64) = 0.1. These probabilities summarize the prior information about the 
unknown parameter, before any observation is collected. Once the data Y become 
available, the Bayes formula can be used to compute the posterior probability 


PVA )POn) _ P(Y|n) Pn) 


On|¥) = = 
pap p(y) P(Y|6n)p(;,) + pC |@a)p a) 


(4.1) 


Of course, p(Oa|Y) = 1 — p(@,|Y). In particular, if the data Y are consistent with 
diabetes symptoms, it may well happen that p(@,|Y) > 0.5, in which case 6 = 64 
would be taken as the final estimate. 

In the previous example, the prior probability distribution assigned to 6 reflects a 
real experiment that is the random choice of a patient from a population where 90% of 
subjects are healthy, which implies a prejudice in favour of 0 = 6). In other words, 
the prior distribution ranks the candidate parameters according to the available a 
priori knowledge. If we look at the numerator of (4.1), we see that it combines a 
priori information with the data through the product of the prior probability p(@),) 
and the likelihood p(Y |@;,). In the example, the population was a binary one (either 
healthy or diabetic), but we can imagine more complex populations allowing for 
several countable or even uncountable possible values of 0. 

In the actual Bayesian paradigm a further step is made: the parameters 0 are 
assigned a prior probability p(6;,), even if there does not exist an underlying experi- 
ment that draws the model from a population of possible models. According to the 
subjective definition of probability, p(@ = @) represents the (subjective) degree of 
belief that @ is going to take the value @. In particular, in analogy with the regulariza- 
tion penalty, it is possible to rank the possible values of 6, assigning a low probabil- 
ity to values whose occurrence is deemed unlikely. In our context, the intrinsically 
subjective nature of the prior probability, a controversial issue in the confrontation 
between the frequentist and Bayesian paradigms, is specular to the subjective choice 
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of the regularization penalty: rather than expressing the preference for some solu- 
tions through the choice of a proper penalty, the preference is formulated by means 
a prior distribution. 

As shown in the following, many formulas and results can be indifferently derived 
adopting either the regularization or the Bayesian paradigm. However, the Bayesian 
approach has its pros. In particular, the tuning of the regularization parameter, rather 
than being addressed on an ad hoc basis, can be formulated as a statistical estimation 
problem. Moreover, the Bayesian paradigm offers a very natural way to asses uncer- 
tainty intervals, whereas the regularization paradigm has a harder time assessing the 
amount of bias in the estimate. Among the cons, one may mention the need for a 
deeper probabilistic background in order to gain a full comprehension of all aspects. 

Throughout the chapter we will mainly focus on the linear Gaussian case, but 
the approach is more general and some hints at generalizations will be provided. In 
addition, we will use @ to denote the stochastic vector that has generated the data, 
in contrast with the deterministic @ used in the classical setting discussed in the 
previous chapter. 


4.2 Incorporating Prior Knowledge via Bayesian 
Estimation 


We consider the problem of estimating a parameter vector 0 € R”, based on the 
observation vector Y € R”. The two ingredients of Bayesian estimation are the 
prior distribution of 6, also known by short as prior, and the conditional distribution 
of Y given @. As already observed, the basic assumption is that the parameter vector 
0 is not completely unknown, but rather some prior knowledge is available that 
is formulated in terms of subjective probability, specified as a probability density 
function: 
p8) : R” > R. 


The density function p(@) is chosen by the user so as to assign a low probability to 
values whose occurrence is deemed unlikely. For instance, if 0 is a scalar parameter 
whose value is believed to lie more or less around 30, hardly smaller than 20 and 
hardly larger than 40, this prior knowledge can be embedded in a Gaussian density 
with £0 = wg = 30 and standard deviation og = 5: 


0 ~ N (30, 25). 


In fact, under this distribution, p (|0 — uo| > 209) = p (10 — 30| > 10) < 0.05. 
Although not impossible, it is considered unlikely that values of 0 too distant from 
30 are going to occur. A natural question is how and when our prior knowledge is 
sufficient to specify a distribution. This crucial issue calls for the notion and role of 
hyperparameters, see Sect. 4.2.4, and for the possible use of the maximum entropy 
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principle as a way to obtain an entire probability distribution from partial knowledge 
relative to its moments, see Sect. 4.6. 

The second ingredient is the conditional distribution of Y given @ that, when 
considered as a function of 0, is also known as likelihood: 


p(y, 0) 
pO) ’ 


LO|Y) = p(Y |0) = 


where p(Y, 0) is the joint probability distribution of the random vectors Y and 0. The 
likelihood is usually obtained from some mathematical model of the data. Consider, 
for instance, the simple model 


Y; =0Vi+e, i=1,...,N, 


where e; ~ M” (0, o?) are independent and identically distributed measurement 
errors, with known variance o”. Conditional on 0, i.e., assuming that 6 is known, Y; 
is Gaussian with 


E[¥i|0]=0Vi, Var (Y;|0) = o° 


so that, in view of independence, the likelihood is 
N 
LOY) = pY 10) =] ple). p16) = NOVI, 0). 
i=l 


When both the prior distribution p(@) and the likelihood p(Y |0) have been spec- 
ified, the Bayes formula yields the posterior distribution 


P(Y |0)p@) 


0 Y = 
p@lY) oY) 


We have seen that all our prior knowledge was embedded in the prior. In a similar 
way, all the knowledge obtained by the combination of prior information with the new 
information brought by the observations is now embedded in the posterior distribution 
p(0|Y), denoted by short as posterior. 

Although all the relevant information is encapsulated within the posterior, a point 
estimate is often required for practical or communication purposes. The Maximum 
A Posteriori (MAP) estimate is the value that maximizes the posterior: 


oM4P _ arg max p(Y|6). (4.2) 


Its interpretation is simple, as it represents the most likely value, once the prior 
knowledge has been updated taking into account the observations. Alternatively, the 
mean squared error 
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A A 2 
MSE(6) = & (4 = 6) |Y 


can be used as a criterion to select the point estimate ô. Above, &(-|Y) denotes the 
expected value taken with respect to the posterior distribution p(0 |Y). The following 
classical result from estimation theory (whose proof is in Sect. 4.13.1) then holds. 


Theorem 4.1 The minimizer of the MSE 


98 = arg min MSE (Ê) 
0 


is known as Bayes estimate and can be shown to be equal to the conditional mean: 
68 = €[6|Y]. 


A third point estimate is the conditional median used especially in view of its 
statistical robustness when the posterior is obtained numerically via stochastic sim- 
ulation algorithms, see Sect. 4.10. 

When, in addition to a point estimate, an assessment of the uncertainty is needed, 
it can be derived from the posterior through the computation of a properly defined 
credible region C, € R” such that 


Pr(0 € C,|Y) = y. (4.3) 


For example, C, could be taken as the smallest region such that (4.3) holds, a choice 
that goes under the name of highest posterior density region. 


4.2.1 Multivariate Gaussian Variables 


In this subsection, some basic properties and definitions of multivariate Gaussian 
variables are recalled. This review is instrumental to the derivation of the Bayesian 
estimator when observations and parameters are jointly Gaussian, see Sect. 4.2.2. In 
turn, this will pave the way to the analysis of the linear model under additive Gaussian 
measurement errors, see Sect. 4.2.3. 

A random vector Z = [Z;...Z,]’ is said to be distributed according to a non- 
degenerate m-variate Gaussian distribution if its joint probability density function is 
of the type 


yr 


1 1 Ty-l 
Zigkst) lm) = ————— x —5(z-p)' V G=). 4.4 
me = Tamrdav 7 (4.4) 


where V is a symmetric positive definite matrix and u is some vector in R”. 
It can be shown that 
é6(Z)=p, Var(Z)=V. 
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Then, the notation 
Zl (it, V) 


(already used before in the scalar case) indicates that Z is a multivariate Gaussian 
(Normal) random vector with mean jz and variance matrix V. 


Property 4.1 If Z~ N (u, V) and Y = AZ, where A € R"*™,n <m, is a full- 
rank deterministic matrix, then 


Y ~ N (Au, AVA‘). 


In particular, it follows that the marginal distributions of the entries of Z are 
Gaussian: 
Zi ~ N (wis Vii). 


Property 4.2 Assuming Z ~ N (u, V), letX = [Z3 ... Zn)", Y = [Zn41 - -< Zml”, 
where 1 < n < m, and partition u and V accordingly: 


a= Ux Vxx Vxy 
uy | | Vyx Vr] 


Then, p(X|Y = y) is a multivariate Gaussian density function with 


E(Xx|Y = y) = ux + Vay Vyy (y — Hy) 
Var(X|Y = y) = Vx — Vxy Vyy Vyx 


and we can write 
(XIY = y) ~ N (ux + Vxy Vyy O — by), Vex — Vxy Vyy Vrx) 


where X|Y = y stands for the random vector X conditional on Y = y. 


4.2.2 The Gaussian Case 


Let us consider the case in which the observation vector Y € R” and the unknown 
vector 0 € R” are jointly Gaussian: 


0 Lo Xo Xoy 
xN r , Xy >Q. 4.5 
[>] (i [z ay D J oe 
The key idea behind Bayesian estimation is referring to the posterior distribution of 


0 given Y as representative of the state of knowledge about the unknown vector. It 
follows from Property 4.2 that such posterior is Gaussian as well: 
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OY ~N (uo + LoyZy'(Y — py), Xo — Loy Ly' Eyo). (4.6) 


gMAP 


In view of Gaussianity, coincides with the conditional expectation £ (0 |Y): 


0? = OMAP = E (0|Y) = pe + Xoy Xy ' (Y — uy). (4.7) 
The reliability of the estimate can be assessed by the posterior variance 
Loy = Var (0| Y) = Xo — Xoy X7 ' Eyo 


based on which the so-called credible intervals can be derived as explained below. 
The posterior variance of 6; is the i-th diagonal entry of the posterior covariance 
matrix: 


Igy = [Zor]; - 
Observing that 6;|Y ~ WY (08, Oey)» it follows that 
Pr (6? — 1.9609,y < 6; < OP — 1.9609,\y|Y) = 0.95 (4.8) 


so that [0P — 1.9606, \y, oP + 1.9606, y ] is the 95%-credible interval for the parameter 
0i, given the observation vector Y. If two or more parameters are jointly considered, 
the notion of credible region can be obtained in a similar way. In the Gaussian case, 
such regions are suitable (hyper)-ellipsoids centred in 08. 


4.2.3. The Linear Gaussian Model 


The Bayesian approach can be applied to the estimation of the standard linear model 


in matrix form 
Y = 0+E, E~ NO, XE), Xe>O (4.9) 


in which Y € R” and the parameter vector 0 is no more regarded as a deterministic 
quantity, but as a random vector independent of E. In particular, we assume that some 
prior information is available which is embedded in a Gaussian prior distribution 


0 ~ N (uo, Xo), Xo > 0. 


Since Y is the linear combination of the jointly Gaussian vectors 0 and E, the vec- 
tors Y and 0 are jointly Gaussian as well. Hereafter, positive definiteness of Xe 
is assumed if not stated otherwise. The singular case, see Remark 4.1, amounts to 
assuming perfect knowledge of some linear combination of the unknown parameters 
or, equivalently, to constrain the estimated vector 0 to belong to a prescribed sub- 
space. The ability to incorporate this type of constraint is not unique to the Bayesian 
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approach. In the context of the deterministic regularization, an example is given by 
the optimal regularization matrix P = 064 , derived in Sect. 3.4.2.1. 

In order to obtain the Bayes estimate according to (4.7), we need to compute 
uy = &(Y), Ley = Cov(6, Y), and Ly = Var(Y): 


Uy = E (Y) = Opry 
Var(Y) = Var(®6) + Var(E) = D Xb" + Ve 
Cov(6, Y) = Cov(6, D0) + Cov (0, E) = Sp@". 
Then, we can apply (4.7) to obtain 
OP = po + ZoD" (D EoD" + Ve) (Y — Opp) (4.10) 
Var(O|Y) = Zo — Sob (P EDT + Xe) O Xo. (4.11) 


The proofs of the following two classical results are reported in Sects. 4.13.2 and 
4.13.3. 


Theorem 4.2 (Orthogonality property) 


e [6° —@)Y"| =0. (4.12) 


The following lemma, whose proof is in Sect. 4.13.3, is useful in order to obtain 
an alternative expression that proves more convenient, especially whenn < N. 


Lemma 4.1 Jt holds that 


MoO! (D EoD" + Ve) | = (OT Lz'S + U5!) OT Tz. 


By applying the previous lemma, the alternative expression of the Bayes estimate is 
obtained 


oB = (OTD 'O + De!) (OT L,Y + Uy uo) (4.13) 
Var(6|¥) = (@7 55'S + Dey}. (4.14) 


fa) MAP 


As already noted, the Bayes estimate coincides with , the maximum of the 


posterior density: 
p(AlY) x p(¥|0)p@). 


Recall that, in view of the assumed linear model (4.9), 
Y|0 ~ N (D90, Le) 


and note that 


4.2 Incorporating Prior Knowledge via Bayesian Estimation 103 


1 
hen?) == 70- Ho)” Xz (8 — uo) (4.15) 


1 
log p(Y|0) = c — z7 — p0)" X; (Y — $6), (4.16) 


where cı and c3 are constants we are not concerned with. Therefore, the maximization 
of the posterior density can be written as 


gMAP — arg max log p(Y|@) + log p(@) 


= arg max(Y — p0) Dz; Y — 0) + (0 — uo) Xz ' (0 — uo) 


whose solution is easily shown to be given by (4.13). This shows that, under Gaussian- 
ity assumptions, the Bayes estimate of the linear model can be seen as a regularized 
least squares estimator with quadratic regularization term (ReLS-Q), see Sect. 3.4. 
In particular, if 

Xg =0°Iy, poe =0, (4.17) 


the Bayes and MAP estimators, 


o? = oN? = arg min IY — 66||? +07 Pa, (4.18) 


coincide with the ReLS estimator with regularization matrix P = X4 /o?. Under the 
further assumption X'g = AI, the MAP estimator coincides with a ridge regression 
estimator with y = 07/A. 


Remark 4.1 When X = P, where P = PT > Qis singular, one can still use (4.10) 
to obtain the Bayes estimate, while (4.13) and the quadratic problem (4.18) are no 
more valid due to the nonexistence of Xg ' Nevertheless, by replicating the derivation 
in Remark 3.1, it is still possible to interpret the Bayes estimate as the solution of a 
constrained quadratic problem. In particular, under (4.17), we have that 


8? = arg min, ||Y — ©0||3 + 07 P*O (4.19) 
subj.to U0 =0, (4.20) 


where U2 was defined in Remark 3.1, as part of the singular value decomposition of 
P. The result can be interpreted as follows. A singular variance matrix means that 
we have perfect knowledge on some linear combination of the parameter vector. In 
particular, 


Var [UZ 0] = Uz Var (0) U2 


= UF [U; V] E ol [U1 V2] Ur =0, 
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where, with reference to the SVD of P, we have exploited the fact that UZ U = 0. 
As a consequence, 
Pr(Uz 0 = Unig) = 1, 


thus justifying the presence of the equality constraints in the quadratic problem 
(4.19)-(4.20), where uo = 0 is assumed. Recalling the orthogonality of U; and U2, 
we have that uF 0 = Oimplies that € Range(U;) = Range(P). Therefore, the con- 
strained quadratic problem (4.19)—(4.20) can also be equivalently reformulated as 


6% = argmin ||Y — @0|? +07 Pt. (4.21) 
6 € Range(P) 


One can also assess that the solution of this problem can be written as 
68 = Pb’ (PPP! + Tz), 


an expression which does not require invertibility of any matrix. 
In conclusion, the Bayes estimate always exists and is unique. In any case, it can 
be written as (4.7) with Xy ! replaced by its pseudoinverse. 


The Bayesian interpretation of deterministic regularization can be exploited to 
obtain a guideline for the selection of the regularization matrix. The simplest case is 
when some statistics, e.g., based on samples coming from past problems, is available 
for the parameter vector 0. Then, the Bayesian interpretation suggests to select the 
covariance matrix of 6, divided by the error variance o”, as regularization matrix. If 
examples from the past are not available, one may rely on prior knowledge, telling 
that some entries of 0 have smaller variance than others or that some correlation 
exists between the entries. 


4.2.4 Hierarchical Bayes: Hyperparameters 


In the cases in which prior information on the parameters is not sufficient to specify 
a prior, it is common to resort to hierarchical Bayesian models. Instead of fixing the 
prior, a family of priors is considered, parametrized by one or more hyperparameters. 
As an example, consider the case in which prior knowledge could be formalized in 
terms of zero-mean independent and equally distributed parameters whose absolute 
value is not too large. In absence of more precise information on their size, we could 
adopt the following prior: 
0 ~ N (0, Aly), 


where the scalar A, called hyperparameter, enters the game as a further unknown 
quantity. More in general, the prior distribution p(@|~) may depend on a hyperpa- 
rameter vector œ. One may also want to consider a hyperparameter vector 6 entering 
the definition of the likelihood p(Y |0, 8). The most common example is when the 
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measurement variance o is not known and is therefore treated as a hyperparameter. 


In the following, the vector of all hyperparameters will be denoted by 
T 
n=[o7p"] . 


For a given 7, we will denote by @M4?(n) and 68(7) the corresponding MAP and 
Bayes estimates: 


gMAP(n) = arg max p(@|¥, n) (4.22) 


6°(n) = EOY, n) = J ap(AlY, n)d9, (4.23) 


where 
p(Y 10, B)p@la) 


OLY, = i 
nen J pV, Byp@la)da 


(4.24) 


4.3 Bayesian Interpretation of the James—Stein Estimator 


In this section, we show that the James—Stein estimator can be seen as a particular 
Bayesian estimator. As seen, in Eq. (1.2), the measurements model is 


Y=0+E, E~V(0,07ly). (4.25) 


In a Bayesian setting, the parameter vector is regarded as a random vector, whose 
distribution reflects our state of knowledge. In particular, we assume 


0 ~ N (0, Aly), (4.26) 


where à plays the role of hyperparameter. It follows that 6 and Y are zero-mean 
jointly Gaussian variables with 


Xoy = E(0YT) = E (007) = àIy, Ey =G4(YY')=(Ato%)Iy. (4.27) 


According to (4.7), the Bayes estimate is given by the conditional expectation 


@O\Y) = Xoy E7 Y = 


mo (1 — rayes) Y, (4.28) 


where 


(4.29) 
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It is apparent that the estimator (4.28) has the same structure as James—Stein’s one, 
with r replaced by rgayes. 


Since Y and 0 are jointly Gaussian, (0|Y) = MAP 


, where 


_ iY -eol 101? 
n F 


AP 
oM^P — arg mi 
6 o? À 


a2 
= arg min ||Y — 0|? + —|l@|\ 
8 À 


which highlights the fact that £ (8|Y) is the solution of a regularized least squares 
problem, controlled by the regularization parameter 07/2. 

If the variances A and ø? could be assigned on the basis of prior knowledge, the 
similarity would be only formal. Let us make a step forward, considering the case in 
which the variance ø? is given, while A is estimated from the data. The basic idea is 
that the hyperparameter À could be tuned based on the observed vector Y and plugged 
into (4.29) to obtain an estimate of rgayes. Alternatively, one may focus directly on 
finding a sensible estimate of rgayes. In this respect, we are going to show that Stein’s 
r is an unbiased estimate of rgaye,; under the Gaussian model (4.25) and (4.26) [6]. 
For this purpose, we will exploit a property of the inverse chi-square variable. 


Definition 4.1 (chi-square random variable) The sum of the squares of n standard 
Gaussian independent random variables is a nonnegative valued random variable 
known as chi-square variable with n degrees of freedom: 


n 
=> X, Ho NOM). 
i=l 
Its mean and expectation are 


E(x) =n, Var (x7) = 2n. 


The inverse of a chi-square variable is called inverse chi-square. For n > 2, its mean 


is i i 
&|—|= P 4.30 


Now, assume N > 2 and observe that 


Iv? _ Dee a 
Ato? jA+o? oe 


Recalling that the expectation of the inverse chi-square is equal to 1/(N — 2), we 


have that n 
À 1 1 
Lit ]-Lgl-ats 
IY II XN (N — 2) 


Therefore, 
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= lBayes- 


2 2 
sin =6 |S | Z 
BAK à +o? 
This means that James-Stein’s shrinking coefficient r can be seen as an unbiased 
estimator of the shrinking coefficient f gayes appearing in the formula of the posterior 
expectation. 

The example is instructive under several respects. First, it shows that, under suit- 
able probabilistic assumptions, the typical structure of regularized estimators can be 
justified through Bayesian arguments. The second point has to do with the tuning of 
the regularization parameters. In the empirical Bayes approach, see Sect. 4.4, there 
is a preliminary step in which a point estimate of hyperparameters is obtained by 
standard estimation methods. Then, this point estimate is plugged into the expres- 
sion of the Bayesian estimator. Although a full Bayesian approach would call for the 
joint estimation of parameters and hyperparameters, the two-step empirical Bayes 
approach not only conjugates simplicity and effectiveness but provides a probabilistic 
underpinning to regularized identification methods. 


4.4 Full and Empirical Bayes Approaches 


When the prior, and possibly the likelihood, include hyperparameters, Bayesian esti- 
mation becomes more complex and gives rise to alternative approaches. In principle, 
we want to obtain the posterior distribution 


P(Y |0)p@) 


O|Y) = 
ply) oY) 


However, if a hierarchical Bayesian model is adopted, we do not know p(@), but only 
p(@|n). At the cost of assigning a prior p(7) also to the hyperparameters, the prior 
p(@) can be obtained by marginalization of the joint probability density: 


p(0) = J pde I p@lnyp(n)dn. 


In general, this integral has to be computed numerically, e.g., by Monte Carlo meth- 
ods. This leads to full Bayesian methods that compute the desired p(@|Y) regarding 
both parameters and hyperparameters as random variables. Some remarks on these 
methods will be given in Sect. 4.10. 

The justification for a simpler computational scheme stems from the following 
reformulation of the posterior: 


p@lY) = I p@,nlY)dn = J pln, Yyp(nl¥)dn. (4.31) 
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Observe that 
pnl Y) x p(Y|n)p(n), (4.32) 


where L(n|Y) = p(Y|n) is the likelihood of the hyperparameter vector 7. It is also 
called marginal likelihood because it is obtained from the marginalization with 
respect to 0 of the joint density p(Y, 6 |n): 


L(n|Y) = [oc eimae = foveo, mp@|n)de. (4.33) 


If data are sufficiently informative, the marginal likelihood has good chances to 
be unimodal and sharply peaked in a neighbourhood of the maximum likelihood 


estimate 


nM = arg max p(¥ |n). 


When this happens and p(n) is rather uninformative (as it should be), from (4.32) it 
follows that p(7|Y) is peaked as well. Then, as long as the properties of p(@|n, Y) 
do not change rapidly with n near 7™", the integral (4.31) can be approximated as 


p(¥ 10, PpO 


OlY) ~ pln", Y) = 
p@|Y) ~ p@ln ) DEn 


In practice, this suggests to compute the posterior using the prior p* (0) = p(6|n™"") 
associated with the maximum likelihood estimate of hyperparameters. More in gen- 
eral, Empirical Bayes (EB) methods adopt a two-stage scheme. In the first step, a 
point estimate 7* is computed which is then kept fixed in the second step, when the 
posterior of the parameters is obtained, based on the prior p*(9) = p(0|n*). 

Among the advantages of the approach one may mention its simplicity, especially 
when there are few hyperparameters and the posterior p(@|Y, 7™”) is easily obtained 
as in the jointly Gaussian case. Moreover, the tuning of 7 admits an intuitive inter- 
pretation as the counterpart of model order selection in classic parametric estimation 
methods. The main drawback is that the EB method fails to propagate the uncertainty 
of the point estimate *. 

Under the linear Gaussian model (4.9), the integral (4.33) admits a closed-form 
solution. In fact, since 


Y ~ N (Duen), EM), EM) = Emb" + Ven), 


we have 
1 1 madi 
log L(n|Y) = E log(2x det(X)) — ries — ugo) X (Y — Gus), (4.34) 


where in the right-hand side dependence on 7 has been omitted for simplicity. 
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Therefore, application of Empirical Bayes estimation to the linear model (4.9) 
would consist of the following two steps: 
Step 1: 


n* = n™™ = arg max L(n|Y). 
n 


Step 2: Let wo = ue (n*), Xe = X (n*), Xo = X (n*) and compute the posterior 
expectation according to Sect. 4.2.3. 

When the likelihood and the prior are such that integral (4.33) cannot be com- 
puted explicitly, an approximation is needed. In particular, one can resort to the 
Laplace approximation, which is based on a second-order Taylor expansion of 
log p(Y, 6|n) around 6M? (7) defined in (4.22), from which an integrable approxi- 
mation of p(Y, 6|n) appearing in (4.33) is obtained. Note, however, that the Laplace 
approximation has to be recalculated for each evaluation of L(y|Y) occurring during 
the iterative computation of nM}. 


4.5 Improper Priors and the Bias Space 


The use of priors is most useful whenever the data alone are not sufficient to provide 
reliable parameter estimates but there exists some a priori knowledge that can be 
exploited. It may happen that for some parameters the introduction of a prior is not 
possible or not desirable, because their estimation can be satisfactorily performed 
anyway, given the information in the data. This can be accounted for by assuming 
that such parameters have improper priors. 

In order to deal with the case where p parameters 0? € R? have a proper prior 
and the remaining n — p parameters 0” € R”~? have an improper prior, consider the 
following model: 


$ 

Y=@0+E, #=|2 Y], 0=|6r| (4.35) 

0 ~ N0, Xo), E~ V(0,07ly) (4.36) 
X 0 

yS k P) X>0. (4.37) 


The (asymptotically) improper prior for 0” is obtained by letting a > oo so that 0! 
has infinite variance, i.e., its density is flat. This amounts to complete lack of prior 
knowledge for the last n — p entries of the parameter vector 8 that, for simplicity, is 
assumed to be zero mean. The use of improper priors in a Bayesian setting has the 
same effect as the introduction of a bias space in a deterministic regularization setting. 
Within such a subspace, parameters are immune from regularization, a feature that 
could be useful to apply regularization only where needed without causing undesired 
distortions. The following theorem, whose proof is in Sect.4.13.4, is analogous to 
a result obtained in [22] to obtain a Bayesian interpretation of smoothing splines. It 
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illustrates the asymptotic behaviour of posterior means and variances as a goes to 
infinity. 


Theorem 4.3 (adapted from [22]) /frank(®) = n and rank(2) = n — p, then 
Jim E(0'|Y) = (WTM W) YTM Y 
Jim E(0?|Y) = SRTM! (I, — Y (WTM!) YTM Y 
M = QER" +0°Iy 


-1 
i 2 T 210 0 
jim Var @|Y) =o (> +o l 5- 2 


An interesting benefit of improper priors is the possibility of reducing the number 
of hyperparameters by treating some of them as unknowns whose prior is improper. 
Letting the symbol 1„xı denotes a column vector of ones, assume, for example, 
that 0 ~ N (ulnx1; Xo), i.e., all the scalar entries of 6 share the same prior mean 
u. In most cases, very little is known about u that could be therefore regarded 
as a hyperparameter to be tuned by marginal likelihood maximization. It can be 
then treated as a deterministically known variable, according to the Empirical Bayes 
approach, see Sect. 4.4. By this choice, however, the hyperparameter is fixed to its 
point estimate and its uncertainty is not propagated, implying that the uncertainty of 
68 will be underestimated if assessed by (4.14). 

Alternatively, u can be treated as a further random parameter. For this purpose, 
define 6 = 6 — u and consider the model 


p(n a 

H Oa 
Y=@06+E, =| dlx] 
8 ~ N (0, 25), E~ N (0, 0°Iy). 


This formulation decreases the number of hyperparameters, without introducing 
prejudices (provided we let a + oo). More importantly, it is now possible to assess 
the joint uncertainty of the estimates of u and @ through the posterior variance 
Var (6|Y). 


4.6 Maximum Entropy Priors 


A major appeal of the Bayesian paradigm lies in its ability to provide a rational 
foundation to regularization: one starts from prior knowledge and then proceeds 
with its formalization in terms of a probabilistic prior, from which the regularization 
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penalty is finally derived. However, there is a stumbling block in the way, because 
the available prior knowledge is often too vague to avoid arbitrariness in the choice 
of the prior distribution. As a matter of fact, the derivation of systematic approaches 
for the selection of prior distributions is a classic topic of Bayesian estimation theory. 
In this section, the approach based on entropy maximization is briefly reviewed. 

The starting point is the observation that, even when prior information is absent 
or very limited, there are candidate distributions that are obviously preferable, due to 
symmetry arguments. Assume, for instance, that candidate values for a scalar param- 
eter 0 are known to belong to a finite set {6;, i = 1, ...m} and no further information 
is available. Then, the only reasonable prior distribution will be p(@ = 6;) = 1/m. In 
fact, assigning unequal probabilities would create an unjustified asymmetry, given 
that our prior information does not make any distinction between the m possible 
values of the parameter. 

The case of a continuous-valued parameter 0 taking values in a finite interval 
[a, b] can be addressed in a similar way. In this case, a reasonable prior distribution 
is the uniform one: 


1 
E r A 
PO) | 0, elsewhere ` 


In both examples, we might say the chosen distributions are those that reflect the 
maximum ignorance about the unknown parameter. 

The next step is to formalize this notion of maximum ignorance in contexts where 
some partial information about @ is available. This can be done by means of the 
notion of entropy of a probability distribution. For a discrete distribution p(-) taking 
values p(6;) on a numerable set {0;}, the entropy H is defined as 


H(p) = — )  p(@) log p@)). 


Note that the minimum possible entropy H(p) = 0 occurs when the probability 
is concentrated at a unique value 6. This is the case of a maximally informative 
distribution such that p(@ = 6) =1. Conversely, if the set {0;} has cardinality m, 
the maximum value H (p) = log(m) is achieved in correspondence of the uniform 
distribution p(@ = 6;) = 1/m, Vi. In other words, the larger the entropy, the less 
information is conveyed by the distribution. 

For continuous-valued random variables, the notion of differential entropy h(p) 
is introduced: 


h(p) = -f p8) log p(8)d9, 


where Dg denotes the support of the distribution. Note that, among distributions 
with finite support, the maximum possible (differential) entropy is achieved by the 
uniform distribution. 

The principle of Maximum Entropy (MaxEnt) states that the admissible distribu- 
tion with largest entropy is the one that best represents the current state of knowledge. 
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The admissible distributions are those that satisfy a set of constraints, chosen so as 
to incorporate all the available prior knowledge. For instance, if the prior knowledge 
amounts to knowing that @ € [a, b], the prior suggested by the MaxEnt principle is 
the uniform distribution. Other types of constraints are typically expressed as expec- 
tations of functions of the parameters 6. In particular, consider a random variable 0, 
subject to known values ņ; of m expectations 


é1gi(9)] = f sed =, t=...,m. (4.38) 


Then, we have the following useful result. 


Theorem 4.4 (General form of maximum entropy distributions, based on [12]) 
Among all the distributions satisfying (4.38), the maximum entropy one is of expo- 
nential type 

p(@) = Aexp(—Aigi(8) —... — Amn (0)), (4.39) 


where i; are m constants determined from (4.38) and A is such that 


+00 
af exp(—À181(0) — ... — Am&m(0))d0 = 1. (4.40) 


[oe] 


Example 4.5 (MaxEnt prior from information on expected absolute value) Assume 
that prior knowledge is summarized by the expectation &|0| = 7. Then, the MaxEnt 
prior is the solution of the constrained optimization problem 


max h(p) s.t. &|6| =n. 
p 


Obviously, m = 1 and gı (0) = |0|. In view of (4.39) and (4.40), p(@) is a Laplace 
distribution: 
p8) = 0.5Ae™>"!, 


The value of 4 is found by imposing the constraint on the expectation: 


+00 
/ 0.5|0|Ae “ldo = n. 


(oe) 


Since the constraint on the expectation is satisfied for A = 1/n, the following Laplace 
distribution is eventually obtained: 


_ 
@ = 
p T 


Therefore, starting from a very partial information, such as a guess on the expected 
absolute value of the parameter, it is possible to completely specify a prior distribu- 
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tion that: (i) is coherent with the prior knowledge and (ii) does not introduces undue 
assumptions because it is the least informative one, so far as entropy is taken as a 
measure of informativeness. One could object that it is scarcely realistic to assume 
prior knowledge of the expected absolute value of 6. However, if we adopt the empir- 
ical Bayes framework, the objection is circumvented by the possibility of treating n 
as a hyperparameter that will be estimated from data. 

Therefore, prior knowledge may just tell that the expectation of |0| is finite, 
without specifying a value for this expectation. The MaxEnt principle then suggests 
the functional form of the prior that incorporates a hyperparameter n, whose tuning, 
e.g., by marginal likelihood maximization, see Sect. 4.4, will be the first step of the 
actual estimation algorithm. As it will be seen in the following, this particular prior 
is associated with the Bayesian interpretation of the regularization penalty employed 
by the so-called Lasso estimator that has been already introduced in a deterministic 
regularization setting in Sect. 3.6.1.1. 


For our purposes, of particular interest are MaxEnt priors satisfying constraints 
on the second-order moments. In the scalar case, we have the following classical 
result, e.g., see [19]. 


Proposition 4.1 (based on [12]) Let 0 be a zero-mean random variable with known 
variance &67 = i. Then, the MaxEnt distribution is normal: 


8 ~ V(0,A). 


Also in this case, the necessity of specifying A is not an issue, because the unknown 
variance can be regarded as a hyperparameter and tuned by marginal likelihood 
maximization. In other words, if the only prior knowledge is that 0 has a finite, 
yet unknown, variance, the MaxEnt principle suggests the use of a normal prior 
parametrized by its variance. 

When 6 is a vector, a multivariate prior might be derived according to the following 
proposition. 


Proposition 4.2 (based on [12]) Let 6 be a zero-mean n-dimensional random vec- 
tor whose entries have known variances E6? =;,i = 1,...,n. Then, the MaxEnt 
distribution is a multivariate normal with diagonal covariance matrix: 


O~ AN (O, Xo), Xo = diag{A;}. 


The importance of this result is twofold. First, also in the multivariate case, the 
least informative distribution under second moment constraints is of normal type. 
Moreover, if the covariances are unknown, it is seen that the MaxEnt principle yields 
independent distributions. 

A shortcoming of the maximum entropy approach is that the resulting distributions 
are not invariant with respect to reparametrizations of the unknown vector. To make 
an example, we have already seen that the maximum entropy distribution of 0 in a 
finite interval [1, 2] is uniform. On the other hand, if the reparametrization y = 1/0 
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is adopted and the MaxEnt approach is applied to y, the resulting prior will be a 
uniform distribution for y in [0.5,1], which corresponds to 


2 

4,1<0<2 

02? = — 
p@) | 0, elsewhere, 


which is obviously different from a uniform distribution. A possible way to limit 
arbitrariness is to specify that, before applying the MaxEnt principle, one should 
first identify the “object of interest”. Indeed, choosing either 0 or 1/0 as object of 
interest is going to yield different MaxEnt priors. 


4.7 Model Approximation via Optimal Projection x 


Approximate low-order models are commonly used even when there is awareness that 
the real data are generated by a more complex model. Motivations may range from 
their use for control design purposes to better interpretability of the phenomena under 
investigation. Unfortunately, under model misspecification, several nice properties 
enjoyed by standard estimators are no more valid. In particular, a naive application 
of the least squares may provide far less than satisfactory results. In this section, it is 
shown that, within the Bayesian framework, the search for an optimal approximate 
model can be given a rigorous formulation that admits a projection-based solution. 

We assume that the data Y are distributed according to (4.9), which summarizes 
our state of knowledge. However, rather than resorting to Bayesian estimation of 
the vector 0, an approximate model, typically of low order, is searched for. For 
instance, if 60; were the samples of an impulse response, one might be interested in 
approximating them by a parametric model: 


A~ 96), g(t) =[e1(0)--- en) ]’, 


Ts : 
where ¢ = [é «~. Ly ] is the unknown parameter vector. For example, in order to 
approximate the sequence 6; by means of a single exponential function, it suffices to 
let q = 2 and 


gilt) = tie”, 


where ¢; is the amplitude and ¢) is the rate coefficient of the exponential. 
A very natural estimator is the least squares one: 


g5 = ae IY — Bg (c)IP. 


Note that ¿45 coincides with the maximum likelihood estimate if the following model 
is assumed: 
Y=@g(t)+E, E~ NV(0,07ly). 
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In the present context, however, no claim is made that reality conforms to our approx- 
imate model. It may well be that the true 6, being more complex than its parsimonious 
parametric model g(¢), is better represented by the model (4.9). Nevertheless, we are 
interested in finding the best approximation of 6 within a set Z = {g(f)|¢ € R4, } 
of parametric approximations. 

Under model (4.9), the optimal approximate model g* can be defined as the one 
that minimizes the mean squared error &||@ — g||*. For a generic model g = g(¢), 
parametrized by the vector ¢ € R1, q < n, we have that 


gi = g(6"), C" i= argminé [le — girly]. (4.41) 


where the conditional expectation is taken with reference to the probability measure 
specified by (4.9). The following theorem, whose proof is in Sect.4.13.5, was first 
derived in the context of linear system identification [20]. It shows that the optimal 
approximation is the projection of the Bayes estimate 6® onto the set 2. 


Theorem 4.6 (Optimal approximation, based on [20]) Assume that (4.9) holds. 
Then, 
¢* = arg min |6? — g0). (4.42) 


In view of the last theorem, the best approximation g(¢) € Z can be computed by 
a two-step procedure. First, the Bayes estimate 6° is obtained and in the second step 
the optimal g(¢*) is calculated as the solution of the least squares problem (4.42). 

An interesting question is whether the obtained approximation is still optimal if 
the goal is minimizing the error, not with respect to 6, but with respect to the noiseless 
output @@. In other words, the goal is finding g° that minimizes ||®@ — @g°)||?. This 
can be done by introducing a weighted norm in the cost function: 


8° = 66"), 6° = arg min £ [110 — g@)Ilw|¥]. (4.43) 
where IIx lI) stands for x’ Wx. In particular, if W = 67 ®, then 


6 — g(S)\liy = I8 — Bge). 


By extending the proof of Theorem 4.6 to the case of a weighted norm, the following 
projection result is obtained. 


Theorem 4.7 (Optimal weighted approximation, based on [20]) Assume that (4.9) 
holds. Then, 
6° = arg min ||" — g) liv- (4.44) 


The consequence is that different approximations g° are obtained depending on 
their prospective use. If the scope is just approximating 0, then W = J, but, if the 
scope is predicting the outputs, then W = 7 @ and a different result is obtained. 
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4.8 Equivalent Degrees of Freedom 


In this section, the Bayesian estimation problem for the linear model is analysed 
by means of a diagonalization approach. The purpose is twofold: (i) the equivalent 
degrees of freedom of the Bayesian estimator are introduced together with their 
relationship with suitable weighted squared sums of residuals and squared sums of 
estimated parameters; (ii) itis shown that nM}, the ML estimate of the hyperparameter 
vector, satisfies meaningful conditions involving the degrees of freedom. Finally, the 
obtained results are applied to the tuning of the regularization parameter, defined 
as the ratio between scaling factors for the noise variance Xg and the parameter 
variance X. For the sake of simplicity, in this section, we assume ue = 0. 

Let us consider the case when the hyperparameters are just two scaling factors 
for the covariance matrices Xg and X, that is, 


Sok. FeO (4.45) 
Xg =0°Y, o7 +0 (4.46) 
n=[ 0], (4.47) 


where K and W are known definite positive matrices. In such a case, it is immediate 
to see that the Bayes estimate 


2 -1 
g? = (ewe + T) o'w-'y 


depends only on the ratio y = 07/4, which behaves as a deterministic regularization 
parameter. This means that only the ratio between the scaling factors is relevant to 
the computation of a point estimate, although both of them are needed to compute the 
posterior variance (4.14). When ¥ = Jy and K = I,, the above estimator provides 
a Bayesian interpretation to the classical ridge regression estimator. In particular, y 
can be interpreted as a noise-to-signal ratio and its tuning reformulated as a statistical 
estimation problem. 

Given a positive definite symmetric matrix S, let S! = (S 1/ 2) T be its symmetric 
square root, i.e., S!/7S!/2 = S. Now, consider the singular value decomposition 


w! oK!’ = UDV", 
where U and V are square matrices such that U TU =Iy and VTV = I, and 


D e R^*” is a diagonal matrix with diagonal entries {d;}, i = 1, . . . , n, see (3.134). 
Moreover, define 
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Observe that 
6 (EET) = UTY P EEE" Y "PU = o°UTU =O" ly. 
Analogously, & (00 T) = iI,. Moreover, 


Y = UTY! (p0 + E) = UTY oK? VVT K "64 E 
= UTUDVĪ VŐ + E = DÖ + E. 


In view of these properties, it follows that the original Bayesian estimation prob- 
lem admits the following diagonal reformulation: 


Y=D6O+E, E~ NO, o?Iy), O~ N (0, ål), (4.48) 


where E and 6 are independent of each other. 
In view of statistical independence, we have N independent scalar models: 


y=, i=ntl,...,N, 


where ù; ~ -V(0,07),i =1,...,N,and6; ~ N (0, à), i = 1,...,7. 
By (4.11), it is straightforward to see that the Bayes estimates are 
-B Adi Yi di yi 


— = » e—a S i) 
o? + hd? y +d? 


or, in matrix form, 7 2 
6° = (DD +y) DÝ. 


Let the residuals be defined as &; = y; — d; 68, i = 1,..., N, where 


= di l<i<n 
ee oe (4.49) 
Then, we have 
d?¥; i 
z =y- ae 2 (4.50) 
y+d ytd; 
te? y 2 
gf = eH ee ee aa —) asi) 
(y +d}? (y +47)? y +d; y +d; 


or, in matrix form, 


118 4 Bayesian Interpretation of Regularization 


é=y(D'D+yly)'¥, jal? = 07 (N —trace(D(D" D + yIy)'D")). 
(4.52) 
It is worth noting that the above relationships do not hold for a generic regulariza- 
tion parameter y, but only when y = o7/A. In the remaining part, we present some 
results that were first derived in the context of Bayesian deconvolution in [5]. The 
proof of the following proposition is in Sect. 4.13.6. 


Proposition 4.3 (based on [5]) For a given hyperparameter vector n, let WRSS 
denote the following weighted squared sum of residuals: 


WRSS = (Y — £08)’ w! (Y — 608), 
where 08 = &[6|Y, n]. Then, 
E (WRSS) = o° (N — trace(H(y))), 


where 
H(y)=@(0'w'6+4+yK"')'o'w! 


is the so-called hat matrix. 


As already noted, see (3.64), when Xg = o*Iy, the predicted output Ê = poB 
and the measured output Y are related through the hat matrix: 


Ê = HOY. 


In order to better understand the link between the hat matrix and the degrees of 
freedom, just consider the standard linear model Y = 0 + E, 6 € R”, and the cor- 
responding LS estimate 61S = (67 Ø)! ØTY. The predicted output is f = HESY, 
where H1S = (6! @)~'@! enjoys the property trace(H'5) = n. 

It is this analogy that justifies the introduction of equivalent degrees of freedom 
which we already encountered in (3.65) as a function of the regularized estimate 6® 
described in the deterministic context. Its definition, here derived starting from the 
stochastic context, is reported below stressing its dependence on the regularization 
parameter y. 


Definition 4.2 (equivalent degrees of freedom) The quantity 
dof(y) = trace(H(y)), O<dof(y) <n (4.53) 


is called equivalent degrees of freedom. 


In view of (4.52), 
n 2 


d: 


i=1 
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so that dof (y) is a monotonically decreasing function of y with 0 < dof(y) <n. 
The equivalent degrees of freedom provide an easily understandable measure of the 
flexibility of estimator: for instance, if they are approximately equal to three, the 
Bayesian estimator has a flexibility comparable to a model with three parameters. 
For linear-in-parameter models estimated by ordinary or weighted least squares, the 
degrees of freedom coincide with the rank of the regressor matrix and, therefore, 
they can take only integer values. The equivalent degrees of freedom of the Bayesian 
estimator, conversely, are a nonnegative real number controlled by y. 

The next theorem establishes a connection between the degrees of freedom and 


the ML estimate " 
nt = [a (07)"" ] 


of the hyperparameter vector. Accordingly, we define 


(02) 


AML 


ML 


Moreover, we introduce the following weighted squared sum of estimated parame- 
ters: 


WPSS = (68)? Ke = JPP =Y 


i=1 


diy; 


PedF F (4.54) 


The proof of the following result is in Sect. 4.13.7. 


Theorem 4.8 (based on [5]) Assume that model (4.9) holds where Xọ and Xg 
are as in (4.46)-(4.47). Then, the ML estimates of the hyperparameters satisfy the 
following necessary conditions: 


WRSS = (0°)™™ (N — dof(y™")) (4.55) 
WPSS = Am dof(y™"). (4.56) 


By taking the ratio between (4.55) and (4.56), the following proposition is 
obtained. 


Proposition 4.4 (based on [5]) Zf AM} and (02) are nonnull and finite, then 


dof(yML) WRSS 
me ony (4.57) 
N — dof(yM-) WPSS 


This last corollary can be used as a simple and practical tuning procedure as it requires 
just a line search on the scalar y. Of course, (4.57) relies on the necessary conditions 
of Theorem 4.8, so that one has to check if the solution corresponds to a maximum 
of the likelihood function. 
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4.9 Bayesian Function Reconstruction 


In this section, the Bayesian estimation approach is illustrated through its application 
to the reconstruction of an unknown function from noisy samples. The observations 
will be generated by adding pseudorandom noise to a known function g(x), so that 
the performances of alternative estimators can be directly assessed by comparison 
with the ground truth. The selected g(x) is the same function (3.26) used in the 
previous chapter in order to illustrate polynomial regression: 


g(x) = (sin(x))’?(1 — x”), x € [0, 1]. (4.58) 


Also the noise model is the same: 


yi = g(x) +e i=1,...,N. (4.59) 
We let N = 40, xı = 0, x49 = 1, and x2, . . . , x39 are evenly spaced points between 
xı and x4ọ. Finally, e;, i = 1,..., 40, are i.i.d. Gaussian distributed with mean zero 


and standard deviation 0.034. 
The problem of estimating 6; = g(t;), i.e., the samples of the unknown function, 
is a particular case of the linear Gaussian model (4.9) with ® = Jy, that is, 


Y=0+E, E~NV(0,07 ly). (4.60) 


Since @ is square, in this case, the number n of unknowns coincides with the number 
N of observations. 

The noisy data and the true function are displayed in the top left panel of Fig. 4.1. 
It is assumed that the available prior knowledge regards the “regularity” of g(-) and 
the knowledge that g(0) = 0. A possible probabilistic translation of this qualitative 
knowledge is assuming that 6; is a so-called random walk: 


6; = 6;_1 + wi, i=1,..., N, o = 0, 


where w; ~ -¥ (0, A) are independent random variables. In fact, under the random 
walk model, the first difference 


6; — 8-1 = Wj 


has a finite variance, equal to à. Hence, if we approximate the derivative of g(-) by 
the first difference 6; — 6;_1, this approximation is less than 1.96./A with probability 
0.95, which guarantees that the profile of the function cannot vary too quickly. Note 
that, due to the qualitative nature of the prior knowledge, the precise value of A is 
unknown, so that it has to be treated as a hyperparameter. Conversely, it is assumed 
that the true value of o? is known. Summarizing, we have 
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Fig. 4.1 Function reconstruction example. Top left: noisy data and true function. Top right, bottom 
left and bottom right: Residual sum of squares, i.e., the sum of the squared differences between 
the function values and their estimates, degrees of freedom and marginal loglikelihood against 
the hyperparameter A. The oracle denotes the value that minimizes RSS while ML indicates the 


maximizer of the marginal likelihood 


i 
6; = ) Wj, t S= ds 
j=1 


or, in matrix form, 


0 = Fw, F= 


oo 


,N 
Wi 
w2 
w= | v3 


WN 
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Observing that Var(w) = Ay, the prior variance of the parameter vector is 


11...1 
.2 

X =AFF' =A 
12...N 


For a given À, the Bayes estimate 6® is obtained according to (4.10) and can be 
written as i 
OP = Xo (Zo +0°In) Y. 


The corresponding equivalent degrees of freedom, obtained by (4.53), are now 
thought as a (monotonically nondecreasing) function of A, i.e., 


dof(A) = trace H(A), H(A) = Zo (Zo +07Iv) |, Zo =AFF". 


In the bottom left panel of Fig. 4.1, the degrees of freedom are plotted against À. 
For small values of A they are close to zero and get closer to N = 40 as A goes to 
infinity. It is a rather general feature that the function dof (A) is better visualized on 
a semilog scale. In order to tune the regularization parameter A, one can resort to the 
maximization of the marginal loglikelihood: 


AML = arg max {-5 log(2x det(2’)) — sey 
X = Yo to*ly =AFF! +07 ly. 

It turns out that AML = 4.92e — 4, the corresponding degrees of freedom being 12.17. 
For the sake of comparison, à = 6.6le — 4 is the best possible value, i.e., the one 
provided by an oracle that exploits the knowledge of the true function in order 
to minimize the sum of the squared reconstruction errors. This latter quantity is 
function of A and here denoted by RSS(A). As seen in the top right panel of Fig. 4.1, 
marginal likelihood maximization achieves RSS = 9.80e — 2, not much worse than 
RSS = 9.7le — 2 achieved by the oracle, whose associated degrees of freedom are 
13.88. Therefore, in this specific case, the marginal likelihood criterion somehow 
underestimates the complexity of the model. 

In Fig. 4.2, the estimates obtained in correspondence of six different values of 
à are displayed. It is apparent that for à = le — 6 and à = le — 5 the estimated 
function is overregularized, while overfitting occurs for à = le — 1 and à = le — 2. 
The two bottom panels display the oracle and ML estimates, the former exhibiting a 
slightly more regular profile. 

Finally, observing that in our case Xoy = Xe, we have 


Loy = Var(6|Y) = Xo = Do X7 Zo 
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Fig. 4.2 Function reconstruction example. The panels display the Bayes estimates g(x) corre- 


sponding to six different values of the hyperparameter A, including the one provided by the oracle 
and the maximum likelihood one 


124 4 Bayesian Interpretation of Regularization 


Empirical Bayes estimate with conf. intervals (A =+4.92e-04) 

0.3 T T T T T T T T T 
— 92) 
O data 
(a) 
0.25 +|— — 0.95 confidence limits a 


0.2 


0.1 


0.05 


Fig. 4.3 Function reconstruction example. True function and Empirical Bayes estimate g(x) based 
on 2™¥ together with its 95% Bayesian credible intervals 


and we can compute the 95% Bayesian credible intervals, according to (4.8). As it 
can be seen from Fig. 4.3, the credible limits successfully capture the uncertainty, as 
demonstrated by the fact that the true function lies within the limits. 

This simple example has shown that Bayesian estimation can be effectively 
employed in order to reconstruct an unknown function without need of assuming 
a specific parametric structure, e.g., polynomial or other. The key idea is the use 
of a smoothness prior, expressed through the assumed prior distribution of the first 
differences of the function. The associated variance A is treated as a hyperparameter 
that can be tuned via marginal likelihood maximization. Altogether, this is a flexi- 
ble Empirical Bayes scheme that can be employed as a general-purpose black-box 
estimator. 

Of interest is also the fact that the considered function could have been the impulse 
response of a dynamical system. In this respect, the example highlights also the limits 
of the approach. A first issue, easily fixable, has to do with the insufficient smoothness 
of the estimate. As seen in Fig. 4.3, the true function is significantly smoother than 
its estimate. As a matter of fact, it is not difficult to increase the regularity of the 
Bayes estimate: for instance, it suffices to assume that the samples 0; = g(x;) are an 
integrated random walk: 
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0; = G-1 + & 
& = &-1+ wi, 


where w; ~ -¥ (0, à) are again independent and identically distributed. This prior 
distribution is going to yield smoother profiles. Rather interestingly, the obtained 
estimate can be seen as the discrete-time counterpart of cubic smoothing splines, a 
method widely used for the nonparametric reconstruction of unknown functions. 

A more serious issue regards extrapolation properties of the estimate that are in 
turn connected with the type of asymptotic decay shown by stable impulse responses. 
As it can be seen from Fig. 4.3, oscillations and credible intervals do not tend to 
dampen as x increases. While it would be easy to compute the Bayes estimate also for 
values far beyond the observation window, the result would be disappointing. Indeed, 
coherently with the diffusive nature of random walks, the width of the credible band 
would diverge, which is unnecessarily conservative when a stable impulse response 
is reconstructed. It appears that the task of identifying impulse responses calls for 
prior distributions that are specifically suited to the their features, especially the 
asymptotic ones. The development of these prior distributions, or equivalently the 
design of suitable regularization penalties, will be a central topic of the subsequent 
chapters. 


4.10 Markov Chain Monte Carlo Estimation 


As already mentioned in Sect. 4.4, in the full Bayesian approach the estimate 


p@lY) = f pO, nlY)dn = J pln, Yyp(lY)dn 


requires a marginalization with respect to the hyperparameter vector 7). In general, this 
integral cannot be computed analytically. Nevertheless it can be computed numer- 
ically by means of Markov Chain Monte Carlo (MCMC) methods that generate 
pseudorandom samples drawn from the joint posterior density p(@, n|Y). The Gibbs 
sampling (GS) algorithm is the most straightforward and popular MCMC method. 
Its goal is to simulate a realization of a Markov chain, whose samples, though not 
independent of each other, form an ergodic process whose stationary distribution 
coincides with the desired posterior. Hence, provided that the burn-in phase is dis- 
carded, the posterior distribution is approximated by the histogram of the samples. 
In order to generate the samples, at each step a random extraction is made from a 
proposal distribution. In the Gibbs sampler, the proposal distribution is the so-called 
full conditional, that is, the probability of a given element of the parameter vector 
given the data and the current values of all other elements. 

For the linear Gaussian model (4.9), a Gibbs sampler may be implemented as 
follows: 
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. Select initializations n°, 6°, and let k = 0. 

. Draw a sample n“*+) from the full conditional distribution p(7|0™, Y). 

. Draw a sample 6“*+ from the full conditional distribution p(6|n“t, Y). 
. If k = kmax, end, else k = k + 1 and go to step 2. 


UN 


This stochastic simulation algorithm generates a Markov chain whose stationary 
distribution coincides with p(@, n|Y). Therefore, though correlated, the generated 
samples {0 n} can be used to estimate the (joint and marginal) posterior dis- 
tributions and also the posterior expectations via the proper sample averages. For 
example, 

1 y (k) 
X 0® ~ &@¥). 


k=1 


The choice of the prior distributions p(@|7) and p(7|Y) has a critical influence on the 
efficiency of the scheme. The priors are called conjugate, when for each parameter 
the prior and the full conditional belong to the same distribution family. This implies 
that the same random variable generators can be used throughout the simulation. 

Consider model (4.9), where Xg is known and Xa = à K, witha unknown. Below, 
we describe a Gibbs sampling scheme for obtaining the posterior distributions of 0 
and 7 = A. For 6, the prior is 6|A ~ .V (0, AK). A conjugate prior for A is the inverse 
Gamma distribution: 


z © Pi, 82), £1, 82 > 0. 


In other words, it is assumed that 1/A is distributed as a Gamma random variable, so 


that PE 
1 1y 2 
= L -(%2) 
ehe) A 


With this choice of the prior, the full conditional of 1/à will be distributed as a 
suitable Gamma variable, Vk. More precisely, it can be shown that, if 


m 1 
p (1A) ~ M (0, AIN), (5) ~ P(81, 82) 


= 12 
p(x) ~r ar ee. (4.61) 
À 2° 2. i ` 


Recall that the mean and variance of the Gamma random variable are gı /g2 and 
gı/ g2, respectively. For the prior to be as uninformative as possible, we let g; and g2 
decrease to zero. Under these assumptions, the Gibbs sampler unfolds as follows: 


then 


1. Initialize à and 0, e.g., using the empirical Bayes estimates 


Ma, o = 0” Seen" y) 
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and let k = 0. 
2. Draw a sample 1/A“+ from the full conditional distribution 
1 1 N oO K-l9® 
—jo, ¥)=p(-jo ) =r (—, ——_—_ }. 4.62 
p ( al 3 PAS 7 7 (4.62) 


3. Draw a sample 6“*+ from the full conditional distribution 
p(o|ae+, Y) = N (6A, Y), Var(o|a*t, Y)) 


whose mean and variance are obtained according to (4.10) or (4.13). 
4. If k = kmax, end, else k = k + 1 and go to step 2. 


Above, the expression of the full conditional (4.62) is a direct consequence of the 
conjugacy property (4.61), as it can be seen by letting 0 = K710% , where K~!/? 
is a symmetric matrix such that K712 K7! = K7!. 

When there are other hyperparameters to tune, e.g., the noise variance o?, the 
MCMC scheme can be properly extended. Provided that they exist, conjugate priors 
ensure an efficient sampling from the proposal distributions that generate the random 
samples, although a variety of MCMC schemes are available that deal with non- 
conjugate priors at the cost of an increased computational effort. 

The main advantage of MCMC methods is that they implement the full Bayesian 
framework that is only approximated by the empirical Bayes scheme. In particular, 
MCMC methods do not neglect the hyperparameter uncertainty which is correctly 
propagated to the parameter estimate. However, as already discussed in Sect. 4.4, if 
data are informative enough to ensure a precise estimate of the hyperparameters, the 
difference between MCMC and empirical Bayes estimates (and associated credible 
regions) may be of minor importance. 


4.11 Model Selection Using Bayes Factors 


As discussed in Sect. 2.6.2, one fundamental issue is the selection of the “best” model 
inside a class of postulated structures. In the classical setting, this can be performed 
using criteria like AIC (2.34) and BIC (2.36) or adopting a cross-validation strategy. 
We will now see that the Bayesian approach provides a powerful alternative based 
on the concept of posterior model probability. 

Let .@' be a model structure parametrized by the vector x’. In the system iden- 
tification scenario discussed in Chap. 2, the structures could be ARMAX models of 
different complexity. Hence, each x’ would correspond to the 6! parametrizing (2.1) 
and containing the coefficients of rational transfer functions of different orders. If 
little knowledge on them were available, poorly informative prior densities could be 
assigned. Another example concerns the function estimation problem illustrated in 
Sect. 4.9. Here, x’ could contain the samples 6! of the unknown function g modelled 
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as a stochastic process. Then, the different structures could represent different covari- 
ances of g defined by a random walk or an integrated random walk. Each covariance 
would then depend on an unknown hyperparameter vector 7 containing the variance 
of the random walk increments and possibly also of the measurement noise. So, in 
this case, one would have x! = [6! ni]. Here, nt is a random vector with flat priors 
typically assigned to the variances to include just nonnegativity information. 

Now, suppose that we are given m competitive structures .#'. An important 
conceptual step is to interpret even them as (discrete) random variables, each having 
probability Pr(.Z') before seeing the data Y. The selection of the best model has 
then a natural answer: one should select the structure having the largest posterior 
probability Pr(.@‘|Y). Using Bayes rule, one has 


fS pEi, x')dx! Pr( Mi) 
pY) 


Pr( si |Y) = 


A typical choice is to think of the structures as equiprobable, so that Pr(.#') = 1/m 
for any i. Then, one can select the .@' maximizing the so-called Bayesian evidence 
given by 


pM’) = | pr’, xax. 


Note that this corresponds to the marginal likelihood where all the parameter uncer- 
tainty connected with the i-th structure has been integrated out. Given two structures 
M! and @, the Bayes factor is also defined as follows: 


_ PEIA’) 
2 pa)" 


Hence, large values of Bız indicate that data strongly support ./! as opposed to 4°. 

For the computation of the Bayesian evidence, the same numerical considerations 
reported at the end of Sect. 4.4 then hold. In particular, when the evidence cannot be 
computed explicitly, approximations are needed given by the Laplace approximation. 
Also the BIC criterion is often adopted. In particular, in the function estimation 
problem one can integrate out 0. Then, one can evaluate the complexity of the model 
using the marginal likelihood optimized w.r.t. the hyperparameters n’, then adding a 
term which penalizes the dimension of the hyperparameter vector. This will be also 
discussed later on in Sect. 7.2.1.1. 

MCMC can be also used to compute the evidence by simulating from posterior 
distributions and using the harmonic mean of the likelihood values, see Sect. 4.3 in 
[14]. A more powerful and complex approach employs MCMC techniques able to 
jump between models of different dimensions, an approach known in the literature 
as reversible jump Markov chain Monte Carlo computation [10]. 
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4.12 Further Topics and Advanced Reading 


There is an extensive literature debating on the interpretation of probability as a quan- 
tification of personal belief and it would be impossible to give a satisfactory account 
of all the contributions. The reader interested in studying motivations and foundations 
of subjective probability may refer to [4, 16]. One of the merits of Bayesian probabil- 
ity is its efficacy in addressing ill-posed and ill-conditioned problems, including also 
a wide class of statistical learning problems. The connection between deterministic 
regularization and Bayesian estimation has been pointed out by several authors in dif- 
ferent contexts. Two examples related to spline approximation and neural networks 
are given by [8, 15]. 

The choice and tuning of the priors is undoubtedly the crux of any Bayesian 
approach. It is not a surprise that the tuning of hyperparameters via the Empirical 
Bayes approach emerged early as a practical and effective way to deploy Bayesian 
methods in real-world contexts, see [6] for its use in the study of the James—Stein esti- 
mator. Since the 1980s, thanks to the advent of Markov chain Monte Carlo methods, 
full Bayesian approaches have become a viable alternative, motivating reflections on 
the pros and cons of the two approaches, see, for instance, [17]. In particular, the 
connection between Stein’s Unbiased Risk Estimator (SURE), equivalent degrees 
of freedom and the robustness of marginal likelihood hyperparameter tuning has 
been investigated by [1, 21]. The choice of the prior distributions is somehow more 
controversial. In the present chapter, we exposed the principles of the maximum 
entropy approach, mainly following [12], but other approaches have been advocated 
for finding non-informative priors. A requirement could be invariance with respect 
to change of coordinates, enjoyed, for instance, by Jeffreys’ prior [13]. 

It not unusual to have parameters that should be left immune from regularization. 
In the Bayesian approach, this corresponds to the absence of prior information, 
usually expressed through an infinite variance prior. Although the case could be 
treated by assigning large variances to some parameters, it is numerically more robust 
useful to use the exact formulas. Their derivation by a limit argument followed [22]. 

The idea of deriving approximated parametric models by a suitable projection of 
the Bayes estimate conforms to Hjalmarsson’s advice “always first model as well as 
possible” [11]. The projection result has been derived in [23] for Gaussian processes 
and subsequently extended to general distributions in [20]. 

The equivalent degrees of freedom of a regularized estimator have been studied in 
the context of smoothing by additive [2] and spline models [3, 9], while a discussion 
specialized to the case of Bayesian estimation can be found in [5, 17]. 

Starting by the seminal paper [7], the use of stochastic simulation for computing 
posterior distributions according to a full Bayesian paradigm has gained a wider 
and wider adoption, especially when there exist conjugate priors that allow efficient 
sampling schemes. In particular, this is possible for the linear model discussed in 
this chapter, whose MCMC estimation is discussed in [18]. 
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4.13 Appendix 


4.13.1 Proof of Theorem 4.1 


For simplicity, the proof is given in the scalar parameter case. We have that 


dMSE(6) d i 
dô dô 


(6 — 0)°p(0|Y)d8 


00 


pro +00 
= 26 f p@|Y)de — 2 | Op(@|Y)dé 


(oe) (oe) 


= 2(ô- £ 1011). 


Moreover, 


d MSE (Ô +00 
U MSEUI saf p(@lY)dé =2 > 0. 


dĝ2 


(oe) 


Therefore, 0® = E [6|Y] minimizes MSE(6). 


4.13.2 Proof of Theorem 4.2 


Let X = 08 —@ denote the estimation error. Recalling that @[Y — ® uo] = 
&[E] = 0, from (4.10) it follows that £X = 0. Note also that X and Y are jointly 
Gaussian and 


Cov(X, Y) = G[X(¥ — 6Y)"] = E[XYT] — EXEYT = E[XY"]. 
Now, using (4.7), we have 
&[XY"] = &[(8 - 0)Y"] 
= Xoy X7 'E [Y — ny)Y"] - £ [0 — uo)Y"] 


= Xoy Xy ' (E [YY"] — urny) — £ [0¥"] — mony 
= Xoy X7' Ly — Xoy = 0. 


4.13.3 Proof of Lemma 4.1 


By applying the matrix inversion lemma (3.145) and proceeding with simple matrix 
manipulations, 
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ob'er + OL_b")! = Sob? (az! — L; POTE; O + DN IOT") 
= Sob" X;' — DD" 5, OOTI; o +yz OT) 
= Xoll — DTE; o(@? Ezo + 5!) "oe" E; 
= Yo(G7 E; + Ez- D'ER ONDT Ezo + EN 'S7 Tz" 
= (PTE; 9 + X7 NETER". 


4.13.4 Proof of Theorem 4.3 


In view of (4.13), the conditional variance is 


p1p = = = 
Var(6|Y) = ( +5;') = (ooo i a ki }) l 


o2 


In view of (4.7) 
T T 27-1 RT T -1 
E(O|Y) = Sob (@LoP* +0°1,) Y= aut (aww +My Y. 


By replicating the passages of Lemma 4.1 


A —1 
aW(aww' + My! = (mws t) wt uM. 
a 


Moreover, by applying the matrix inversion lemma, see (3.145), 
L.\7! 
(aww? +My! = M7! -M'Y (uty + *r) yTM! 
a 
; dh ces = 
=M'—MmM ww uly)! (h + —(w mw) WTM. 
a 


Then, letting a —> co complete the proof. Observe that all the inverse matrices 
appearing in the proof exist due to the full-rank assumptions made on ® and W. 


4.13.5 Proof of Theorem 4.6 


The expectation in (4.41) can be rewritten as 
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é[e—6® +6 — ee) lr] 
= [lo — 68 |" +2 (0 — 08)’ (68 — g(c)) + |e — olr] 
=e|jo -PPr +EP- 0l. 

The proof follows by observing that in the last equation the first term does not depend 


on ¢. In the last passage, we have exploited the fact that 68|Y is deterministic and 
equal to &(6|Y). 


4.13.6 Proof of Proposition 4.3 


First observe that 
2=2 
Y Yi 


= 4. 
(y +4?) we 


N 
WRSS = |[él|? = $ 


i=l 
Hence, in view of (4.52) 
EWRSS = 0° (N — trace(D(D'D + yIy)'D")). 
On the other hand, by simple matrix manipulations, it turns out that 
Ulw'? Hy ?U = D(D'’D+yly)'D". 
Finally, recalling that trace(AB) = trace(BA), 
trace(UTY T! Hy !/?U) = trace(W!/27UU YT! H) = trace(H) 


thus proving the thesis. 


4.13.7 Proof of Theorem 4.8 


Without loss of generality, the proof refers to the diagonalized Bayesian estimation 
problem (4.48). The marginal loglikelihood function is 


N N 52 
log(d7A +07) +Y = — +k, 


where « denotes a constant we are not concerned with. By equating to zero the partial 
derivatives with respect to ø? and A we obtain 
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N 1 N 5? 
= =— =0 
2 d?à + 0? 2 (d? + 0?) 


i=] i=l 


` d 3 Gy _ 4 
d?a +0? TOET 


i=1 i=l 


In view of (4.54) and (4.63), 


o? (N — dof (y)) — WRSS = 0 
Adof(y) — WPSS = 0, 


which concludes the proof. 
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Chapter 5 A) 
Regularization for Linear System cicit; 
Identification 


Abstract Regularization has been intensively used in statistics and numerical anal- 
ysis to stabilize the solution of ill-posed inverse problems. Its use in System Identi- 
fication, instead, has been less systematic until very recently. This chapter provides 
an overview of the main motivations for using regularization in system identifica- 
tion from a “classical” (Mean Square Error) statistical perspective, also discussing 
how structural properties of dynamical models such as stability can be controlled via 
regularization. A Bayesian perspective is also provided, and the language of max- 
imum entropy priors is exploited to connect different form of regularization with 
time-domain and frequency-domain properties of dynamical systems. Some numeri- 
cal examples illustrate the role of hyper parameters in controlling model complexity, 
for instance, quantified by the notion of Degrees of Freedom. A brief outlook on 
more advanced topics such as the connection with (orthogonal) basis expansion, 
McMillan degree, Hankel norms is also provided. The chapter is concluded with an 
historical overview on the early developments of the use of regularization in System 
Identification. 


5.1 Preliminaries 


As we have discussed in the preceding chapters, system identification can be framed 
as an inverse problem which aims at finding a dynamical model .@ from a set of 
measured input output “training” data Zr := {u(t), y(t) }r=1,....v- The field of inverse 
problems [5] has motivated the development of, and is pervaded by, regularization 
techniques; as such it is evident that regularization could and should play a major 
role also in the system identification arena. 

On the contrary, we believe it is fair to say that regularization has not had a per- 
vasive impact in system identification until very recently. To introduce its use in this 
field, we will refer to linear models .4 = {4 (0)|0 € D.v} introduced in Chap. 2, 
Eq. (2.1). Note that this notation not only includes classical parametric structures, 
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such as ARX, ARMAX, Box—Jenkins models but also so-called nonparametric ones 
where the “parameter” 0 may be infinite dimensional, e.g., containing all the impulse 
response coefficients of the filters W, (q) and W,,(q) which characterize the predictor 


HO) = Wy(q)y@) + Wi (qu). (5.1) 
The transfer functions W,(q) and W,,(q) are related to the input—output model 
y(t) = Gq, O)u(t) + HG, Ae) 
by the relation 
Wy(q) = [1 — H™'(q,0)] Wala) = H~ (q, 0)G(q, 0), (5.2) 


see also (2.4). 

For simplicity here, we consider the single-output case y(t) € R. In the prediction 
error framework described in Chap.2, the model fit is typically measured by the 
negative log likelihood 


N 
Vy (0) = —2log p(Prl8) = —2 $` log(p(y(r) — $(t18))), 


t=1 


which in the Gaussian case is, up to constants, proportional to the sum of squared 
prediction errors 


N 
Vw 0) x X OA) = $00. 


t=1 


As discussed in Chap. 3, regularization can be added to make the inverse problem of 
estimating the model .@ (0) from data well-posed, and therefore regularized estima- 
tors Or of the form 


Êr := arg min Wy (0) = arg min Vy (0) + J, (0) (5.3) 
6 6 


are considered. This framework has been extensively discussed in the previous 
chapter in the context of linear regression under the squared loss Vy(@) = 
IY — 6||3, see e.g., Eq. (3.57). 

The function J;,(@) is usually referred to as the penalty function, and possibly 
depends on some (hyper-)parameter y. In the simplest case J;,(@) takes the multi- 
plicative form 

J, (0) := yJ(@) 


and y acts a scaling factor which controls the “amount” of regularization. The most 
famous example is the so-called ridge regression problem, in which a quadratic loss 
Vy (6) is used and J (0) := || ||? so that (see also (3.61a)): 
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ÔF := arg min ||IY — D812 + yO? =(©@7 G+ yl) OY. 
g 


However, ridge regression has not had a significant impact in the context of System 
Identification, i.e., when the vector 0 contains the impulse response coefficients of a 
(linear) dynamical system. To understand why, it is important to discuss the choice 
of J, (0). We will see that it plays a fundamental role and strongly influences the 
properties of the estimator Êr. In particular, we will see how J, (@) should be designed 
to encode properties of dynamical systems such as BIBO stability, smoothness in 
time domain and frequency domain, oscillatory behaviour and so on; this is a form 
of “inductive bias” well known and studied in the machine learning community, see 
e.g., [61]. 

As argued in Chap.4, regularization can be given a Bayesian interpretation. In 
fact, introducing a probabilistic prior on model parameters 0 of the form 


_ Jy) 


py (@) xe T (5.4) 


and the Likelihood function: 


(9) 


p(Pr|0) x e7? (5.5) 
the maximum a posteriori (MAP) estimator of 0 (see (4.2)), becomes 

OMAP .— arg max, p(0|2r) (5.6) 

= arg max, p(Pr|)p, (0) (5.7) 

= arg max, log [p(Pr|O)py (6)| (5.8) 

=arg min, — log [p(Pr 19) | — log [Py (@)| (5.9) 

= arg ming Vy(9) + J,(@) (5.10) 

Oy: (5.11) 


In what follows, we will therefore use interchangeably the “regularization” frame- 
work, and thus think of J, (0) as a penalty function, or the “Bayesian” framework, 
and thus think of p,(@) as a prior (with some caution in the infinite-dimensional 
case). 


5.2 MSE and Regularization 


The final goal of modelling is to perform some task, e.g., prediction or control, on 
future unseen data. As such the estimated model quality should be measured having 
the objective in mind. For simplicity, we will consider a prediction task, referring 
the reader to the literature discussed in Sect.5.9 for extensions. To this purpose, in 
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addition to the training data Yr, let us introduce testing data: 
Drest = {Urest (t), Yrest (t) }r=1 oo. Mest * 


A model .@ := M (Ô) estimated using the training data Yr should then predict 
well testing data Y,.;,. In particular, let $(t|0 ) be the output prediction at instant t 
constructed using the estimated model. Then, we can measure the performance of 
M using the Mean Squared Error (MSE) on output (Y ) prediction and assuming that 
data are generated by some “true”, yet unknown parameter vector 6. This is defined 
as 


Ntest 
> 1 ž a42 
MSEy(.@, 0°) =E ( oe Wrest (t) - itt = & (vnest(t) — Stesr(110)) . 


test = 


(5.12) 
where, for simplicity, we have assumed stationary statistics for the couples trest (t), 
Yrest(t) in the last passage. In this section, we will argue that using regularization 
in estimating 6 can indeed help in obtaining a small MSEy (h, 0o). Let us first 
assume that data are generated by an unknown “true” linear time-invariant (LTI) 
causal model: 


y(t) = D> gult — k) + e(t), (5.13) 
k=1 


99 66. 


where the “true 
llie., 


parameter” 0o = [81, 82, 83,---, 8n,---] is an infinite sequence in 


oe) 
XO lgl < 00. 
k=1 


We now consider the model class ⁄ (0) of Finite Impulse Response (FIR) Output 
Error (OE) models 


VA = Due — k) + elt), (5.14) 


k=1 


where the parameter vector 0 € R” contains the coefficients of an nth-order finite 
impulse response model. Under the assumption that the input process is unit variance 
white noise, independent of the measurement noise, and defining 


~  f&k=1,...,0n 
8k =) 0 otherwise 


the MSE (5.12) has the expression 
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MSEy (h, 60) = E (Yrest (t) — Sres (t|9))? , 
=¢ (Xs — Êk)Utest(t — k) + e(t)) 


= 4 (gr — ê) +0? 


k=1 
éllg—all? 
o0 
=} E& — Ele? Da- [&]) +07 
- 
Variance Bias? 
n 
= > eÂ- El] P+ e- EÂ)? + 3 Beto". 
k=1 k=1 k=n+1 
Variance Bias? 


(5.15) 
This is nothing but the usual bias-variance trade-off discussed in Chap. 1: the model 
(6 in this case) has to be rich enough (i.e., n large) to capture the “true” data generating 
mechanism (low bias) but also simple enough (i.e., n small) to be estimated using 
the available data with low variability (low variance). The squared loss 


CO 
Elg -= a’ => (ee -& 
k=1 


present on the right-hand side of (5.15), after the third equality, is called a compound 
loss on the (possibly infinite) vector 6 [60, 63] and defines the MSE. 

Considering compound losses of this type allows us to connect with the discussion 
made in Chap.1 on Stein’s effect. To simplify exposition, let us assume that the 
identification input is a discrete impulse u(t) = ô(t) so that we can think of y(t) as 
direct noisy measurements of all the (nonzero) impulse response coefficients 


YA =g +e) t=1,...,n. (5.16) 


Defining Y := [y(1),..., y(n)]? and E :=[e(1),...,e(n)]’ the measurement 
model (5.16) can be written in vector form 


Y=0+E, Ew~N(0,07I,). (5.17) 


As we have seen in Chap. 1, the least squares estimator 6, s for model (5.17) 
is dominated (for n > 2) by the James—Stein estimator discussed in Sect. 1.1.1. As 
argued in Chap. 1, the James—Stein estimator (1.3) is a special case of a regularized 
estimator (5.3) where J, (0) = y ||0 ||? and y takes the data-dependent form (1.4) 


(n —2)o? 
Iyl? — (n — 20? 


y = 
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Following this route, the James—Stein estimator favours “small” parameters values 
(the regularization term J, (0) = y||@|| 2 penalises large ||@||) and therefore it is to be 
expected that the gap w.r.t. the least square estimator is larger under these circum- 
stances; this has been illustrated in Fig. 1.1. 

As pointed out in Sect. 1.1.2, there is actually nothing special in having chosen 
the origin as a reference. In fact, the penalty term can be replaced with J,(@) = 
y||@ —al|* for any a € R” yielding to estimators which always dominate least 
squares provided y is chosen as 


(n —2)o? 
ly — all? — (n — 2)o? ` 


This teaches us that under certain circumstances it is possible to steer the estimators, 
using a suitable penalty functional, towards certain regions of the parameter space 
(or more generally model space); most importantly, this can be done without any loss 
(actually with a gain) for any possible occurrence of the “true” yet unknown system. 
However the reader should remind that this only holds for the compound loss (5.15) 
and should not be seen as panacea. For instance, James—Stein estimators may provide 
only marginal improvements over Least Squares in situations where the signal-to- 
noise ratio is highly non-uniform over parameter space, a situation often encountered 
in system identification when input signals are not white and poor excitation may be 
present, e.g., in certain frequency bands. This has been illustrated in Example 1.2. 

Therefore, as a take home message from Chap. 1 and the discussion above, we 
should remind that regularization has potential to offer, yet its use in system identi- 
fication is not straightforward. The main reasons are as follows: 


1. Often one cannot restrict to Output Error models (i.e., also noise models should 
be included) and the input process is neither impulsive nor white. Thus, the MSE 
(5.12) takes a different form than (5.15). This calls for extensions of James—Stein 
estimators to weighted losses and non-orthogonal design; to some extent this has 
been pursued in the statistics literature, the reader is referred to [4, 9, 43, 64] and 
references therein. See also [13, Sect. 6]. 

2. While James-Stein estimators have been built with the purpose of showing that the 
least squares estimator is not admissible (see Sect. 1.1.1, for a formal definition), 
it may not necessarily be our primary goal to dominate least squares (or another 
estimator) uniformly over parameter space. In order to cure the ill-conditioning 
phenomenon widely discussed in Chap. 3, it could be advantageous to tailor reg- 
ularization to certain “dynamical-system” oriented properties, thus gaining a lot 
in certain regions of the model space, while possibly incurring in minor losses in 
other regions which are very unlikely. 


The latter is one of the main goals of this book, i.e., to provide the reader with a 
thorough understanding of the role of regularization in estimating dynamical systems 
so as to optimally design regularization methods depending on the intended use of 
the model. In the remaining part of the chapter, we will first introduce the concept 
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of “optimal” prior and derive its expression. We will then connect the structure 
of the optimal prior to the notion of BIBO stability for linear dynamical systems 
and also its link with smoothness in time and frequency domains. Connection with 
the Bayesian setting will also be provided. The chapter will be concluded with an 
historical overview of how the use of regularization in the context of estimation 
of dynamical systems has evolved, illustrating also the role played by time- and 
frequency-domain smoothness. 


5.3 Optimal Regularization for FIR Models 


as 
Y = p0 +E, (5.18) 


where Y := [y(1), ., YN), E := [e(1),...,e(N)]’ and ® contains the input 
samples, which are assumed to be available for all times needed to avoid issues 
related to the initial condition. Then, we will still use 4) to denote the “true” value 
that has generated the data. 

We now consider the class of regularized estimators 


Ô? := arg min ||Y — 60" +070" Pto 
6eR" 


parametrized by the regularization matrix P = PT > 0. As shown in Chap. 3, see 
Eq. (3.60), the generalized ridge regression estimator ÔF can be extended also to the 
case P is singular so that we can assume P = PT > 0. As a matter of fact, in the 
Bayesian framework introduced in Chap. 4, 0? can be also interpreted as the MAP 
estimator 
OMAP .— argmax p(0|Y) 

obtained under the assumption that the noise E is Gaussian, zero mean and variance 
o7I and that 6 is independent of E, zero-mean Gaussian with (possibly singular) 
variance P = PT > 0 (the singular case was described in (4.19)). 

In this section, to emphasize the dependence of the estimator 6® on P = PT > 0, 


we will use the notation 
6P -— OR — MAP 


Our objective now is to study the performance of the estimator 6”, in terms of 
MSE, as a function of P = PT > 0, under the assumption that Y has been generated 
by a “true model” of the form (5.18) with a deterministic and unknown parameter 
0o. Thus, the only source of “randomness” is the noise vector E and the system input 
which is seen as a stochastic process (independent of E) in this section. 
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We consider a test experiment with a new input uest (t), independent of the input 
u(t) used for identification; for convenience of notation, we define the lagged test 
input vector 


Prest (t) = [Urest (t), Ereg Utest (t —n+ py" 
so that under (5.14) the test output is given by 
Yrest (t) = oi (t)O F Crest (t). 


Let us also define the covariance matrix 


W, = Var {Prest (t)} = E test OLARA) 


(note that stationary assumptions are present here, in fact W, does not depend on 
time t) and the MSE matrix 


Mo, (P) := £ (o — ÊP) (Oo — 6°)". 


If we now consider the output mean squared error M SEy (a , 99) in (5.12) computed 
for the model .4@, we obtain 


yy P 2 
MSEy (M, 6°) = & (vies = Sre tlÂ”)) 
kaTa 
= E | Pfs (10 + Crest — dta] 


= E | Go = 8°)" presi (DPs (Do — 8°)" | +0? 


= Tr{£ (0o — Ô? X00 — 8°) E presi OPL} + 0? 
= Tr{ Mo (P)Wa} + a. 


(5.19) 


where in the second to last equation, we have used that the test inputs and noises are 
assumed to be independent of the training inputs and noise in the identification data 
used for estimating 0”. 

A direct consequence of this fact is that, given two prior covariance matrices P 
and P*, if Mg,(P) > Mo,(P*), then 


MSEy(6",%) > MSEy(Ê?*, 00) YW, 


i.e., estimator 0?” outperforms 6” in terms of output prediction for any possible 
choice of the test input covariance W,,. Thus, if the modelling purpose is output 
prediction, it is of interest to minimize, w.r.t. all possible P = PT > 0, the matrix 
Ma, (P), i.e., to find 
P*:=argmin Mą (P), (5.20) 
P=P'>0 
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so that 6?" outperforms any other 6p in terms of output error (5.15) for any choice of 
the (test) input covariance W,,. Under the assumption that the true model generating 
the data is an FIR model of length n with impulse response 
— JO. kn 
k~) 0 k>a, 
the solution P* of the minimization problem in (5.20) has been derived in Proposition 


3.1, and takes the form 
P* = 062, (5.21) 


where 6o is the “true” impulse response of the data-generating mechanism (5.14). 
An alternative proof of the optimal solution (5.21) to problem (5.20) can be found 
in Sect.5.10.1. Since P* depends on the unknown true system, this result is not of 
practical interest; however, if we think of the FIR model (5.14) as the approximation 
of a BIBO stable infinite impulse response model 


y(t) = X boult — K) + elt), (5.22) 
k=1 


the impulse response 6o should have finite £; norm ||4|l1, i.e., 


CO 
lolli == È 10.41 < 00, (5.23) 
k=1 


and therefore 09, should decay as a function of the index k. As a result, the entries 
[P*]i; = 90,:9,; of optimal kernel decay as functions of the row and column indexes 
i and j. In Bayesian terms, it is thus expected that also the elements [P];; of any 
“good” candidate prior variance should do the same. As we will see later in this 
chapter, recent forms of regularization for system identification include a decay rate 
condition on the elements [P]j;;, so as to guarantee that the estimated system is 
BIBO stable. Therefore, we will often refer to conditions on the decay rate of P 
as “stability conditions”. While condition (5.23) is obviously satisfied when 6 is a 
finite dimensional vector, this loose connection between decay rate of the kernel 
and stability needs to be tightened. We will see in the next section that this can be 
properly formulated in a Bayesian framework. 


5.4 Bayesian Formulation and BIBO Stability 


In the previous section, we have considered only FIR models which are reasonable 
approximations of any BIBO LTI system in most practical scenarios. However, it 
is of interest to formulate the estimation of LTI BIBO stable systems in full gen- 
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erality, without assuming the impulse response to be of finite support. This entails 
working with infinite dimensional impulse responses {6;},en. In this chapter, we 
first consider the Bayesian framework, while regularization in infinite-dimensional 
Hilbert spaces will be addressed in Chap. 6. To start with, we model the unknown 
impulse response {0;},<n as a stochastic process indexed over time k; this is the 
straightforward extension to the infinite-dimensional case of (5.18) where 0 was a 
finite-dimensional random vector. In this context, it is of interest to introduce the 
concept of “stable” priors: 


Definition 5.1 (Stable priors) A prior on {0x}xen is said to be stable if realizations 
are sequences almost surely in £4, i.e., 


[0,6] 
Xl <o a.s. 
k=l 


In most of this book, mostly for computational reasons, we will also assume 
that {0x }ken be Gaussian (i.e., that any finite collection of random variables {0k}kez, 
I = {i,,..., ię}, ig E€ N, £ € N are jointly Gaussian). This is formalized in the fol- 
lowing assumption. 


Assumption 5.1 Under the Bayesian framework, we assume {6 }ken to be a Gaus- 
sian stochastic process with mean {mx }ķen and covariance function K (t, s), t, s € N. 


It is an interesting fact that, under additional assumptions on the mean and covari- 
ance functions, the prior is stable according to Definition 5.1, as formalized in the 
following lemma whose proof is in Sect. 5.10.2. 


Lemma 5.1 Under Assumption 5.1 and if the following additional conditions hold 


(oe) 


CO 
S im] = Me, < 00 XOKK, k)? = Ke, < 00, (5.24) 
k=1 k=1 


then the prior is stable as per Definition 5.1, i.e., 


o0 


Xll <œ a.s. 


k=1 


In most of this book, we will also make the assumption that the a priori mean m, 
is identically zero, and thus only the condition on the covariance K (t, s) should be 
checked to ensure stability. We will now discuss different form of prior covariances 
K encountered in the literature. 
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5.5 Smoothness and Contractivity: Time- and 
Frequency-Domain Interpretations 


As seen in Sect.5.3, the optimal regularizer should mimic the “true” impulse 
response, which is clearly unfeasible since the impulse response is unknown. How- 
ever, as already discussed in Sect.5.4, we can use the prior to encode qualitative 
behaviour of impulse responses of BIBO stable linear systems. In particular we have 
seen in Lemma 5.1 that a certain decay condition on the prior mean and covariance 
guarantees the description of only (almost surely) BIBO stable linear systems. The 
simplest example of such a prior model is the following. 


Example 5.2 (Diagonal (DI) prior) Assume the prior mean to be zero m; = 0, 
Vt € N and the covariance function to be diagonal with exponentially decaying 
entries 

K(t,s)=dAa'd(t—s) t.sEeN AsO, O<a<l. 


The parameters A (scale factor) and w (decay rate) are treated as hyperparameters to 
be estimated from data, using e.g., marginal likelihood maximization, as described 
in Sect. 4.4. It is worth observing that the assumptions of Lemma 5.1 are satisfied, 
indeed 


Ylmi=0 SY KEN? =) Vie = Ve ve 
teN teN teN — Ja ~ 


and hence this is a stable prior. 


Itis interesting to observe that a decay rate condition on the impulse response coef- 
ficients is equivalent to assuming a smoothness condition in the frequency domain. 
To see this, let us introduce the frequency response function 


[0.6] 
Gel®) := > Ope, 


k=1 
The L>-norm of the first derivative dale dae can be considered 
dGei) |" 1 a dG(ei”) | 
dw ~ In Jo dw 


which using Parseval’s theorem can be expressed in time domain 


I“ CAGAN -Yei P. (5.25) 
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Fig. 5.1 Sample realizations from the diagonal kernel prior for œ = 0.4 (top) anda = 0.8 (bottom). 
Impulse response is on the left, frequency response (magnitude only) on the right 


Computing higher-order derivatives, and using again Parseval’s theorem, the Lo- 
norm of the mth-order derivative is given by 
d™ Gel®) | 
do™ 


CO 
= 5 "16, |?. (5.26) 
k=1 


Hence, the condition that the {0x} decay rapidly (and possibly exponentially as postu- 
lated by the Diagonal kernel) with k, implies a bound on the L3 norm of the mth-order 
derivatives, i.e., smoothness in the frequency domain of the model. 

As illustrated in Fig. 5.1, smoothness in the frequency domain decreases when a 
increases. However, under this prior, the impulse response coefficients are modelled 
as independent (yet not identically distributed) random variables. Thus no smooth- 
ness in the time domain is included, as for instance, is typically performed with priors 
based on random walk, which are the discrete-time counterpart of spline models as 
discussed in Sect.4.9. A prior model that, in addition to stability, also includes a 
smoothness condition in the time domain, is the so-called TC-kernel: 


Example 5.3 (Tuned-Correlated (TC) prior) Assume the prior mean is zero m, = 0, 
Vt € N and the covariance function takes the form 


K(t,s)=Aa™™) tseN ASO, O<a<l. 


As in the previous example, the parameters A (scale factor) and a (decay rate) are 
treated as hyperparameters to be estimated from data, using e.g., marginal likelihood 
maximization. It is worth observing that also in this case the assumptions of Lemma 
5.1 are satisfied, indeed 


Yi im|=0<00 Y KE, DP => via’ = Ve 7 va 
teN teN teN -Ja ~ 
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Fig. 5.2 Sample realizations from the Tuned-Correlated (TC) prior for a = 0.4 (top) and œ = 0.8 
(bottom). Impulse response is on the left, frequency response (magnitude only) on the right 
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and hence this is a stable prior. In addition, the TC prior now introduces correlation 
between impulse response coefficients, e.g., one has 


60,0, = af Vt>s. 


So, the correlation is different from zero and exponentially decays to zero. 


Figure 5.2 shows two typical realizations from the TC prior, both in time domain 
and frequency domain, for œ = 0.4 (top) and œ = 0.8 (bottom), while Fig. 5.3 shows 
30 sample realizations from the DI (top) and TC (bottom) priors, respectively. 


Example 5.4 (Importance of stable priors) In order to illustrate the advantage of 
using stable priors, we now consider a simple example of identification of an output 
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error model. In particular, we consider a system of the form 


yt) = Yo gult — k) + elt), 


k=1 


where the measured input u (t) and the noise e(t) are realizations from white Gaussian 
noise with zero mean and unit variance. The impulse response is 


For the purpose of identification, we assume the input is available at all time instances 
needed. For illustration purposes, the impulse response has been truncated at k = 50, 
since it is in practice zero for k > 50. We also assume that output measurements 
y(t) are available for t = 1,...,35. The hyperparameters are all estimated using 
marginal likelihood maximization, see Sect.4.4. The results are shown in Fig. 5.4. 
The reconstruction error is measured using the percentage root mean square (RMS) 


error: 
oo = B.\2 
Ha Se x 100%. (5.27) 
k=1 8k 


As illustrated in Fig. 5.4, it is apparent that the results obtained by using the stable 
priors, see panels (b) and (c), outperform those returned by the spline (random walk) 
prior, see panel a, that does not include the stability constraint. The best relative error 
is obtained by the TC priors (~10%) and goes up to as much as ~33% for the spline 
priors. It can also be observed that while for stable priors (b) and (c) confidence 
intervals shrink as time index k grows, the same does not hold for the spline prior. 
The same behaviour had been observed in Sect. 4.9, see Fig. 4.1. 


In the next section, a class of stable priors, which includes TC as a special case, 
will be derived following a first-principle maximum entropy framework. 


5.5.1 Maximum Entropy Priors for Smoothness and 
Stability: From Splines to Dynamical Systems 


The class of Stable Spline priors introduced in the paper [49] extends smoothness 
priors ideas used in splines models introduced in Sect.4.9, embedding exponential 
decay conditions on the impulse response prior. They ultimately lead to estimated 
models which are BIBO stable with probability 1. 

In this section, we will introduce a simple construction of these stable spline priors 
in discrete time. In particular, we will exploit a very natural axiomatic derivation in 
the maximum entropy framework introduced in Chap. 4. For the sake of illustration, 
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Fig. 5.4 Panels a—c: impulse response reconstruction (blue) and true (red) with 95% Bayesian 
confidence intervals (dashed). Panel d is the relative RMS error (5.27) on impulse response recon- 
struction as a function of the scale factor à. For DI and TC priors for each scale factor, the optimal 
decay rate œ is estimated using marginal likelihood. The star denotes the performance obtained 
using the scale factor selected using marginal likelihood optimization. It is remarkable that the rel- 
ative error achieved by maximizing the marginal likelihood is close to the minimum achievable by 


an oracle who would have access to the true impulse response and thus could minimize the relative 
RMS error 


we will only consider the so-called stable spline prior of order one (also known as the 
TC prior, see Example 5.3) and its extension known as DC prior. Possible extensions 
will be discussed, but not developed in full detail. 

The most natural construction, inspired by smoothing spline ideas, is based on 
the following two observations: 


1. Stability: the variance of 6, should decay “sufficiently fast” (see Lemma 5.1), 
possibly exponentially, with the lag k. Assuming a zero-mean process, this can 
be expressed using a condition on second-order moments of the form: 


E [F] =Aaso® k=1,...,n O<a<1. (5.28) 
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For reasons that will become clear later on, imposing equality (as done above) 
rather than inequality constraints is convenient. 

2. Smoothness: the difference between adjacent coefficients should be constrained, 
e.g., as measured by the relative variance, 


E [O1 — 04)" 
A Sa 2 ae k=2,...,n. (5.29) 
E [6-1] 
Using the stability constraint and redefining the constant À g, condition (5.29) can 
be rewritten as 


E [O1 — %)?] = Ara k=2,...,n. (5.30) 


The following theorem (whose proof is reported in Sect. 5.10.3) derives the class 
of maximum entropy priors under the constraints (5.28) and (5.29). Next, in Corollary 
5.1 (whose proof is in Sect. 5.10.4), we will see that for special choices of às and AR 
the well-known TC and DC priors [10, 52] are obtained. 


Theorem 5.5 Let {0x}k=1,...,n be a zero mean, absolutely continuous random vector 
with density pọ (0), that satisfies the following constraints (with 0 <a < 1): 


& [or | = àsa% k=1,...,n 
E [O1 0] = Arak! k=2,...,n (5-31) 
with às € R and ip € R such that 
As(1— Va) < àr < às + Va). (5.32) 
Then, the solution pọ mg(0) of the maximum entropy problem 
Po me := argmax —é& log(pe(@)) (5.33) 
p() s.t. (5.31) 
has the following form: 
—tety-le 
Po,mE(0) = Ce? f (5.34) 


where the matrix =~! has the band structure: 
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The maximum entropy process admits the backward representation 


Ok-1 = apO, + we we ~ V(O,027) ke {l,...,n} 


“= asl +a) — Ae 
ag = a (5.35) 
of = àsa! (1 — aża), (5.36) 
and terminal condition 
G0? = àsa”. (5.37) 
Last, the autocovariance of 6, satisfies the relation: 
EMO = Asak "l gmat (5.38) 
Corollary 5.1 Under the conditions of Theorem 5.5 and defining 
p := agva = Asl Fa) = Xe (5.39) 


2àsvya ` 


the maximum entropy model in Theorem 5.5 corresponds to the so-called DC-kernel 
[10], i.e., 


k+h 


EOrOnr = Asp "a T. (5.40) 
In particular, for àg = às(1 — a), this reduces to the so-called TC kernel [10] with 

E60, = A ga Pah (5.41) 
while for àr = às(1 + @), we obtain the covariance of the “diagonal” kernel 


àsa? k=h 


0 kth: (5.42) 


E00; = | 


Remark 5.1 Inthe maximum entropy kernel derived in Theorem 5.5, which includes 
DC, TC and DI as special cases as stressed in Corollary 5.1, the constant às plays 
only the role of a scale factor while œ is a “decay rate”. Therefore, by fixing As = 1 
and œ = 0.8 we can study the behaviour as the “regularity” constant Ar varies in 
the interval As(1 — VÆ)? = Armin < AR < ARmax = 45(1 + Jor). This is entirely 
equivalent to studying the behaviour of the kernel as a function of the ratio AR/As. 
We thus consider a grid of 9 possible values Ar min =AR1 <AR2 < -+ < ÀR 9 = 
AR.max- Then, Fig. 5.5 plots 5 sample realizations for each of these values with panel 
(i) corresponding to the value Az _;. In particular, Ar,4 = As(1 — a) corresponds to 
the TC kernel and àg, = As(1 + œ) induces the DI kernel. For each realization 
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Fig. 5.5 Sample realizations (solid) and best (least squares) exponential fit as a function of the 
kernel parameters. In all figures œ = 0.8 and às = 1. The regularity parameter A varies, from 


its minimum value AR min = às (1 — Jay ~ 0.011 in panel (1) to the maximum value AR max = 


As + Jay ~ 3.589 in panel (9). Panel (4), with Az = 0.2, corresponds to the TC kernel; panel 
(6) with Ar = 2.6 to the DI kernel 


from the prior (solid line) also its best single-exponential fit is shown in order to 
highlight the “overall” decay rate which can be thought of as an envelope of the 
curves. In panel (1), with Ar taking the smallest possible value, hence imposing the 
“maximum” amount of regularity, all realizations are pure exponentials. In panel (9), 
with Az taking its maximum value, all realizations are pure damped oscillations. In 
fact, in both cases, it can be checked that the corresponding kernel is singular. 


Degrees of Freedom of the DC Kernels 

Theorem 5.5 provides a class of kernels K„ parametrized by the hyperparameter 
vector 7 := [As, Ar, a]. In Fig. 5.5, we have illustrated how realizations from the 
prior change as a function of the regularity parameter Az having fixed As = 1 (or 
equivalently as a function of the ratio Az /A5). As discussed in Chap. 4, choosing the 
prior is equivalent to describing the model class. In the linear system identification 
context, this then defines a penalty function on impulse responses. A way to measure 
the “size” of the model class is to use the concept of equivalent degrees of freedom, 
introduced in the Bayesian context in Sect.4.8. Unfortunately, the degrees of free- 
dom are defined in terms of the output predictor sensitivity and they thus require to 
specify not only the model class but also the experimental conditions under which 
the model is estimated. Only in limiting cases (such as improper prior on finitely 
and linearly parametrized model classes), degrees of freedom become independent 
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of the experiment and coincide with the number of parameters. In this section, we 
thus consider the prototypical setup in Eq. (5.18): 


Y=@%+E YeR“, N=1000, eR”. (5.43) 


We recall that the matrix @ is an Hankel matrix built with the input samples {u(t)} so 
that ®4 implements the convolution of u with 0o. The input {u(t)} is now assumed 
to be a zero-mean unit variance white noise. We also assume the noise {e(t)} is zero- 
mean unit variance white noise. We consider two scenarios in which the order of the 
system (length n of 69) is assumed to be either n = 30 or n = 100. Exploiting the 
derivation in Chap. 4 (see Definition 4.2 and Proposition 4.3), the degrees of freedom 
dof (n), as a function of the hyperparameter vector 7, are given by 


dof (n) = trace (®(@"S + K,')'@"). 


Assuming also here that às = 1, we study how dof(7) varies as a function of Ar 
for three different values of a (0.6, 0.8, and 0.95). The behaviour is illustrated in 
Fig. 5.6 where it is apparent that the maximum is achieved for the DI kernel, and the 
minimum (a bit smaller than 1) is attained at the extremum points, where the kernel 
has rank exactly equal to 1. It is interesting to observe the intertwining between 
the value of @ (that controls the decay rate) and the length of the FIR model n. As 
the coefficient vector 0) changes from length n = 30 (left) to n = 100 (right) the 
effective “size” of the model doesn’t change much for œ = 0.6 and œ = 0.8, while 
it does increase when a = 0.95. This confirms the fact that the kernel, for œ fixed, 
effectively controls the model complexity so that the estimator becomes insensitive 
to the chosen length, provided n is “big enough” w.r.t. œ. In particular n = 15 would 
be sufficient for a = 0.6, n = 30 for a = 0.8 while for æ = 0.95 the effective size 
is about n = 100. 


Extension to Smoothness Conditions on Filtered Versions « 

So far, we limited our attention to so-called “first-order” stable splines, which are 
derived imposing conditions on “first-order” differences, leading to first-order, i.e., 
AR(1), realizations. Of course these constructions can be generalized by replacing 
(5.31) with a higher-order constraint of the form 


E llel? < àsa“ 

z 5.44 

ENO — ELi diberi? < Arat. ee? 
While the first constraint is a “standard” stability condition, the second constraint can 
be interpreted as a filtered frequency domain smoothness condition. In fact, defin- 
ing the filter F(q) := 1 — X? aiq’, let us denote with 0# the sequence obtained 


i=l 


filtering 6, with F (q). The condition 
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Fig. 5.6 Effective degrees of freedom of the DC kernel as a function of Ar (As = 1) for model 
(5.43); n = 30 (left), n = 100 (right). From top to bottom: œ = 0.6, a = 0.8 and a = 0.95 
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Pp 
ENEI? = Ele — Y aiOri ll? < Ara! 


i=l 


implies that 6f should decay “fast” enough (in mean square) and thus 
oo 
EY MOE IP 
k=0 


should be small for any integer m. As a consequence, if 


oe) 
Ge”) = > he, 
k=1 


using Parseval’s theorem, 


2n 
ef 
0 


should be small as well, implying that 6 should concentrate most of his energy 
(variance) in frequency bands where the (absolute value of the) filter F (e/®) is 
small. 

We regard developments of this type, in principle, as a straightforward extension 
of the basic ideas discussed in this chapter to obtain DC kernels. In particular, the 
choice of the coefficients a in (5.44) is a design issue, which can be guided by prior 
knowledge on the candidate models, and its underlying principles and ideas are the 
same as those illustrated above. There are however additional complications due to 
the richer structure of the constraints, which might entail non-trivial issues to derive 
an analytic expression of the kernel. 


d™ G(e/®) 2 


jo 
Fle ) da™ 


5.6 Regularization and Basis Expansion x 


The £2 (ridge regression) regularized estimators that have been discussed in this 
chapter can also be framed in the context of basis expansion using the so-called 
Karhunen—Loéve decomposition of the random process @. For the sake of exposition, 
we will now consider the finite-dimensional case, i.e., we will study FIR models 
of length n of the form (5.14). Extension to the infinite-dimensional case will be 
discussed in the framework of Reproducing Kernel Hilbert Spaces illustrated in 
Chap. 6. Under this finite-dimensional assumption, we consider the covariance matrix 
K € R"*” whose entries satisfy [K] e, s := K (t, s) = cov(@,, 0s). The matrix K can 
be written in terms of its spectral decomposition (Singular Value Decomposition) in 
the form: 
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K = USU" = Y &ujuy u; ER" jul =1 u; Luj; Wis j, (5.45) 


i=l 


where 
U := [u], ..., Un] S := diag{é,..., En}. 


The set of vectors u; € R” provides an orthonormal basis of R” so that any impulse 
response 0 € R” can be written using the orthonormal basis expansion 


=) uh bi:=< 9, u;i >, (5.46) 


where the coefficients 6; =< 0, u; >= u i 6 are therefore zero-mean random vectors 
with covariances 
_ g,TppT,. _ ,,T z 
€ Bi Bj = Eu; 00 uj =U; Ku; = ið; 


Clearly, the argument above can be reversed. Namely, starting from (a possibly 
orthonormal) basis u;,i = 1, ..., the random basis expansion 


n 


0 = > Uu; Bi, Bi ~ N (0, &) (Bi,..-., Bn) independent (5.47) 


i=l 


induces a probability description of the candidate 0’s which turns out to be zero mean 
and with covariance matrix as in (5.45). This interpretation provides a clear link 
between “standard” models described in terms of basis expansions, regularization 
and the Bayesian view. 


Remark 5.2 (Low-Rank Kernel Approximation) The spectral decomposition of the 
kernel (5.45) suggests also that, when some singular values &; are “very small”, it 
can be easily approximated by a low-rank matrix 


n n 
K= J &ju;u ow J &ju;juf ñ <n. 
i=1 


i=1 


This is equivalent to approximating the &; below a certain threshold with zero singular 
values. This threshold can be chosen by a standard SVD-truncation criterion, e.g., 
neglecting singular values below a certain fraction of the largest singular value &,, 


i.e., that satisfy 


éi 
&i < R 


In Fig. 5.7, the value R = 20 has been chosen to plot the most relevant eigenfunctions. 
Low-rank kernel approximation can also be exploited to reduce the computational 
burden in computing the solutions. 
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Fig. 5.7 First 7 eigenfunctions of the DC kernel. To enhance clarity, n is chosen for each com- 
bination of the parameters so that ñ = arg max; i s.t. o? > o? /20 (see Remark 5.2). In all fig- 
ures, œ = 0.8 and As = 1. The regularity parameter Ar varies, from its minimum value AR min = 


As — fay ~ 0.011 in panel (1) to the maximum value AR max = às (1 + Jay ~ 3.589 in panel 
(9). Panel (4), with Ar = 0.2, corresponds to the TC kernel; panel (6) with Àg = 2.6 to the DI kernel 


Figure 5.7 shows the eigenfunctions of the DC kernel for different choices of the 
hyperparameters. As already studied in the previous section, the “complexity” of the 
kernel, measured e.g., by the degrees of freedom as illustrated in Fig. 5.6, varies as 
the hyperparameters change. In the context of basis expansions, this is clear from 
Fig. 5.8 where the singular values of the kernel, i.e., the variances of the basis 
expansion coefficients 6;, introduced in (5.47), vary as the hyperparameters change. 
For instance when Ar = Armin, see panel (1), and Ar = AR max, see panel (9), the 
kernel has rank 1. Instead the singular values decay slower for the DI kernel, see 
panel (5), that also has the largest number of degrees of freedom, see Fig. (5.6). 

Even if this section is devoted to finite impulse response models (i.e., n finite, 
and therefore BIBO stable systems), it still makes sense to discuss what happens 
to the coefficients 6, when n becomes “large” and its relation with BIBO stability. 
In Lemma 5.1, we have seen that a sufficient conditions for a.s. BIBO stability of 
realizations from the Gaussian prior, is that the diagonal elements of K satisfy the 
summability condition 
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Fig. 5.8 First 10 singular values of the DC kernel. In all figures, œ = 0.8 and As = 1. The regu- 
larity parameter AR varies, from its minimum value AR min = Às (1 — Jay ~ 0.011 in panel (1) 
to the maximum value Ar max = 45(1 + vay ~ 3.589 in panel (9). Panel (4), with Ar = 0.2, 
corresponds to the TC kernel; panel (6) with Ar = 2.6 to the DI kernel 


CO 
> K(t, p! < œ 


t=1 


which requires a “sufficiently fast” decay rate of the diagonal K (t, t). A quite natural 
question concerns how the behaviour of K (t, t) reflects on the basis vectors u;. The 
following lemma, whose proof is in Sect.5.10.5, gives the answer. 


Lemma 5.2 The basis vectors u; introduced in (5.45), whose tth elements are 
denoted by uj, satisfy the inequality 


lil < ECRI}, C=) Rhy (5.48) 


Condition (5.48) holds also in the infinite dimensional case, i.e., asn —> œo, provided 
K(t, s) admits the spectral decomposition 
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[e6] 
K(t,s)= > EiUitUis, 
i=l 


where the u; are orthonormal sequences in £2 and the condition Dzi K (t,t) = 
C < œ is satisfied. 


While this result is essentially trivial for n finite, it becomes important when 
n —> OO, since it provides a condition on the tail behaviour of the eigenvectors (eigen- 
functions). For instance, if the diagonal entries (variances) of the kernel K (t, t) decay 
exponentially fast as a function oft, also the u;, do so. The decay of the eigenfunctions 
can be visually inspected in Fig. 5.7. 


5.7 Hankel Nuclear Norm Regularization 


As discussed above, regularization can be used to enforce smoothness and stability 
of impulse responses. Yet this is just one way, and possibly not the most common in 
the field of dynamical systems, to control the “complexity” of model classes. 

For instance, in the parametric approach to system identification, the complex- 
ity can be measured by the dimension of a minimal state-space realization of the 
unknown system. For ease of exposition, let us now only consider the single-input 
single-output output error case (i.e., H(z) = 1). In this case, the number of free 
parameters is 2n + 1 where n is the degree of the denominator of the transfer func- 
tion Gg(z), that also equals the dimension n of a minimal state-space realization of 
Go(z) which is called the McMillan degree of G(z, 0), as seen in Sect. 2.2.1.1. To 
fix notation, let us introduce a minimal state-space realization of G (z, 0) 


X41 = Ax, + Bu, x, eR”, 
ee (5.49) 
which is such that G(z, 0) = C(zI — A)~'B. If {g(k, ®)}xen is the impulse response 
sequence, parametrized by 0, then one has g(k, 0) = CA*'B Vk > 0. 
It is well known from realization theory that the McMillan degree has a close 
connection with the so-called Hankel matrix formed with the impulse response coef- 
ficients, i.e., 


8,0)  g(2,0) 8683,0) ...  g(c,0) 


g2, 0) g(3,6) g(4,0) ... g(c+1,90) 
H=. - H (5.50) 
at. 0) g(r +1,0) e(r +2,0)... 2(r jo- 1,0) 


with r block rows and c block columns. The following lemma holds. 
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Lemma 5.3 (based on [65]) The linear time-invariant system with impulse response 
{g(k, 0)}ken admits a minimal state-space realization of order n (i.e., has McMillan 
degree equal to n) if and only if, for some choice of r, c the following holds: 


n = rank{.%,.(0)} = rank{%4j1i(0)} Vi EN. (5.51) 


In practice, only a finite number of impulse response (Markov) parameters g(k, 0), 
k =1,..., p is available and the problem of finding a state-space model of the form 
(5.49) such that g(k, 0) = CAe'BVk=1,..., p, is known as partial realization 
problem. 


This shows that, indeed, a notion of “complexity” can be attached to the dimension 
n of a minimal state-space realization (5.49); therefore the rank of the Hankel matrix 
JÉ., (0) can be considered as a candidate for performing regularization. This leads 
to the choice of a penalty given by 


Jæ y(0) := y rank{ 2; (0)} (5.52) 


for suitable values of the integers c, r. Unfortunately, similarly to what happens 
for the O quasi-norm ||x||oọ (defined as the number of non-zero entries in the 
vector x) discussed in Sect.3.6.2.1, the rank functional is not convex; as a result 
solving optimization problems involving penalties of the form (5.52) is problematic. 
The very same issue arise in a variety of rank-constrained optimization problems. 

As seen in Chap.3, to overcome this limitations, inspired by work on £; reg- 
ularization, researchers have suggested to use the nuclear norm ||A||, of a matrix 
A € R”*" defined as 


Alle := trace (VATA) =Ù o((A), (5.53) 


I 


where o; (A) denotes the ith singular value of the matrix A, as a surrogate for the 
rank of the matrix A. The nuclear norm is also known as Ky—Fan n-norm or trace 
norm. This choice is motivated by the following lemma. 


Lemma 5.4 (based on [20]) Given a matrix A € R"*” the nuclear norm of A is 
the convex envelope of the rank function on the set 7 := {A € R”*", ||Al] < 1}. 


These considerations have led to a whole class of regularization methods which 
build upon the nuclear norm of the Hankel matrix 


Jæ (0) := y N2 (0)|lx 


as a possible regularizer. Also several extensions have been considered, including 
weighted versions of the form 


J (0) = y || W,-707,.(8) Well 
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where W. and W, are, respectively, “column” and “row” weightings. These latter can 
be possibly adapted iteratively, in the framework of iteratively reweighted methods 
such as those commonly used in conjunction with 4; and/or £, reweighted schemes, 
see e.g., [72]. 


The Hankel norm regularizer can also be studied from a Bayesian perspective, 
considering the prior 


pæ ,y (0) x exp (—y || %,c()llx) & exp (-» Xo oaen) . (5.54) 


To gain some intuition on the structure of this prior, let g(k, 0) = 0, and consider the 
following modified prior which penalizes the nuclear norm of the squared Hankel 
matrix, i.e., 


Bey (0) œ exp (—y || Hc) H,c(0)" lx) x exp (-» Xo KOOD) 


(5.55) 
The reason for introducing p is twofold. The first is related to the fact that the 
prior (5.55) is equivalent to assuming that the entries 6, of the impulse response are 
independent zero mean Gaussians, as formalized in the following proposition. 


Proposition 5.1 (based on [53]) Let pyw,,(@) be as in (5.55) and let 0 € R” ~ 
Dw,,(0), where H,,p(@) is its p x p Hankel matrix (with m = 2p — 1). Then the 
0k’s are zero mean, independent and Gaussian. In particular: 


Oh. ~ (5.56) 


N Q z) if 1 <k < mt 
1 -¢ m+1 
N Q yem) if + <k 


2 
<m 
As illustrated in Fig. 5.9, from (5.56) one sees that the variance of 0x is not decaying 
with the lag k, and hence the prior p,v ,, (9) does not induce a BIBO stable hypothesis 
space. 

Second, the prior pv (9) can be used as a proposal distribution for an MCMC 
scheme, as introduced in Sect.4.10, to sample from the Hankel prior pv’, (@) in 
(5.54) with g(k, 0) = 0k. Samples from pv ,(@) can then be used to approximate 
the variances Var{6;} and the correlations Corr{@p, 0n}. These are shown in Fig. 5.9. 
In particular, the solid line in the left panel shows Var{0;,} as a function of k, while 
the right panel Corr{6p, 0k+n} as a function of h for k fixed to 50. It is clear that, 
even though under pv, (@) the @;’s are not Gaussian, the variances resemble those 
of p.v,,(@) (left panel, dashed line) as and also their correlations resemble those 
of independent variables. For the sake of comparison, the left panel plots also the 
profiles of the impulse response coefficients’ variances using the TC prior for two 
different decay rates (dashdot lines). 
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Fig. 5.9 Prior induced by the Hankel Nuclear Norm: the impulse response coefficients are 
contained in the vector @ € R°, modelled as a random vector with probability density function 
pw „y (0) x exp(—||.%40,40(4)||+). Left: variances of the impulse response coefficients 6, recon- 
structed by MCMC (solid line) and approximated using the prior (5.56) (dashed line). The figure 
also displays the variances of 6, when @ is a Gaussian random vector with stable spline (TC) covari- 
ance (5.41) for two different values of œ (dashdot lines). All the profiles are rescaled so that they 
share the same initial value. Right: 40th row of the matrix containing the correlation coefficients 
returned by the MATLAB command corrcoef (M) where each column of the 79x 10° matrix M 
contains one MCMC realization of 8 under the Hankel prior pv „y (0). The adopted MCMC scheme 
was a random walk Metropolis with increments proportional to the variances (5.56) divided by a 
factor equal to 4 


These observations suggest that, while the nuclear norm regularization (prior) 
accounts for system-theoretic notions of model complexity as defined by the McMil- 
lan degree, it fails to include decay rate and smoothness constraints. One would 
expect, therefore, that Hankel regularization alone may not give satisfactory results 
as it is not able to properly bound the candidate set of models. It turns out that 
the maximum entropy framework discussed in Sect. 5.5.1 can be used to build prior 
distribution which account for stability, smoothness as well as “complexity”. The 
following theorem (whose proof is given in Sect.5.10.6) gives the structure of the 
MaxEnt prior under a simple “TC”-like condition on the stability-smoothness con- 
straint. 


Theorem 5.6 Let {Op},=1..... be a zero mean, absolutely continuous random vector 
with density pa (0), which satisfies the following constraints: 


& [07] <ora 
E [Oi — &)°] < aA- a) k=2,...,m (5.57) 
ENEO <h. 
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Then, the solution Po, mey (0) of the maximum entropy problem 


Pome ‘= argmax —é log(po(0)) (5.58) 
pC) s.t. (5.31) 


has the following form: 


m 
3 „i _9,)2 ut 2 
Po.men(0) x eT Hill cls in 3 Hk- (0k-1—0P) l eT Hmm (5.59) 
k=2 
where the Lagrange multipliers pH, Hi, ..., Um are determined so that the con- 


straints (5.57) are satisfied.! 


Hankel nuclear norm discussed in this chapter is only one possible way to favour 
“simple” (in the sense of having small McMillan degree) models. Indeed, it is by 
no means trivial to use priors of the form (5.59), that involve nuclear norm terms, 
in conjunction with marginal likelihood optimization to estimate hyperparameters. 
Several variations are possible and, indeed, matricial reweighting schemes such as 
those used in [55] can be used in a Bayesian context, leading to iteratively reweighted 
schemes that remind of €;/£2 reweighting [72]. 


5.8 Historical Overview 


The framework discussed in this chapter has indeed a long history that can be traced 
back, by and large outside the control community, until the early ’70s of the last 
century. In this section, we will review these developments and point to similarities 
and differences with the theory developed in this chapter. 


5.8.1 The Distributed Lag Estimator: Prior Means 
and Smoothing 


To the best of our knowledge Bayesian methods for estimating dynamical systems 
have first been advocated in the early ’70s in the econometrics literature for FIR 
models of the form (5.14), which were referred to as distributed lag models. The 
length n of the FIR model was actually left unspecified, and possibly let going to 
infinity. 

In particular, [40, 62] were the first to talk about (and apply) Bayesian methods for 
system identification, arguing that “rigid parametric” structures may be inadequate, 


' Using the complementary slackness conditions it follows that a multiplier may be nonzero only 
if the corresponding inequality in (5.57) holds with an equality sign. 
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extending arguments which can be found in [66] for “static” linear regression models 
to the “dynamical” systems scenario. In the paper [40], having in mind that modes 
of linear time-invariant systems have an exponentially decaying behaviour of the 
type a’, it was suggested to describe the unknown impulse response 0 with a process 
having an exponentially decaying prior mean 


{m }ren m, = ìa la| < 1. (5.60) 


Other possible response patterns had been considered, such as the hump, composed of 
the response build-up, the maximum and its decay, see [40] for details and alternative 
patterns. The covariance function K (t, s) in [40] was taken so that the ratio 


std(0,) 


mt 


remains constant over time t. This was called the “proportionality principle”. and 
can be achieved with the choice 


K (t, s) = cov(,, Oy) := vwa t? |w] < 1 (5.61) 
so that the normalized standard deviation 


std@,) — JK@,t1)  vyywsa” /vws07? 


m; m, at c 


is indeed constant if w,, is so. This would imply that prior credible intervals have 
constant relative size w.r.t. their means, see p. 1065 of [40]. 

The choice (5.61) left the coefficients w,; unspecified and, indeed in [40], it was 
emphasized that “the selection of the values of the set of wij still remains a relatively 
difficult task”; one suggestion, inspired by work on smoothing [34], has been to take 


wy =w! O<w<l (5.62) 


leading to _ Ea 
Kij = vætti yli, (5.63) 


which is exactly the DC kernel introduced in Corollary 5.1. It is also interesting to 
observe that [40] already suggested the use of marginal likelihood to choose the most 
suitable prior distribution in the class. 

Of course, postulating a prior mean m introduces in the estimation procedures 
a remarkable prejudice and requires quite accurate knowledge on the expected 0. 
The paper [62], inspired by “smoothing priors” arguments, suggested instead that 
the prior mean should be zero, and only smoothness conditions on the lags should 
be enforced; this leads to a zero mean prior, i.e., c = 0 in (5.60), with a dth degree 
smoothing covariance. For instance, for d = 2, the prior model can be expressed in 


5.8 Historical Overview 165 


Fig. 5.10 50 realizations Realizations from Shiller prior 
form Shiller’s prior (with 1500 r r r r 


penalty on initial condition 
and first difference). It is 
clear from the picture that the 
realizations are smooth, as 
expected, but certainly do not 
resemble impulse responses 
of (stable) linear systems 


terms of the second-order differences: 


1—2 1 0...0 


galt Emn lo = so 
(O ne 1 —2 1 
| 
=S 


postulating £ 66T = SE00T ST = 1. 
Itis clear from Fig. 5.10 that this prior guarantees smoothness in time domain (and 
therefore low-pass behaviour in frequency domain) but no guarantee on stability. 


5.8.2 Frequency-Domain Smoothing and Stability 


The “time-domain” smoothing discussed in the previous section has been criticized 
by Akaike [1] who posed the question whether time-domain smoothness conditions 
would “be the most natural ones”. Akaike suggested that instead smoothness should 
be enforced in the frequency domain, i.e., considering the frequency response 


G(el”) := > Gye tO. 
k=1 


To this purpose, the L-norm of the first derivative can be considered and we 


have already seen in (5.25) that one obtains 


dG(ei”) 
dw 
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I< CAGAN -Ye P. (5.64) 


dG(et®) 
dw 
(5.64) as a penalty, which can be written in the form: 


2 
Discouraging large | | can thus be obtained using the right-hand side of 


p(y, 0) := 07 Ky 0, 


where 
K aly 1 B = (5.65) 
= ia 8 i a ay i : 
4 y g 4 9 n2 


This is of course equivalent to assuming that the impulse response vector 6 has a 
zero-mean normal prior with covariance K,,. 

Unfortunately, in the limit n — oo, the covariance function (5.65) does not meet 
the (more stringent) sufficient conditions of Lemma 5.1; of course rather straightfor- 
ward extensions include setting penalties on higher-order derivatives, which would 
result in a faster decay rate of the diagonal elements of (5.65). This is a manifestation 
of the well-known link between regularity in the frequency domain and decay rate 
of the impulse response already discussed in Sect. 5.5. 


5.8.3 Exponential Stability and Stochastic Embedding 


More recently, Gaussian priors for dynamical systems have been considered in the 
control literature; in particular, a zero-mean Gaussian prior with diagonal and expo- 
nentially decaying covariance 


00" = Kpa := a diag {1, p, p’,...,p""'} (5.66) 


has been proposed in the so-called “stochastic embedding” framework [25, 26]. Let 
us now briefly introduce the problem: consider an Output Error model of the form 


y(t) = J eult — k) + elt), 


k=1 


where g;,(0), 0 € R” is a parametric description of the unknown impulse response 
{gx}x=1.,....c0 in the model class 4, (0). Let 6 be some parametric estimator of 8, e.g., 
the PEM estimator 


N 
6 = arg min 5 ly(t) — G(z, @)u(t)|I?. (5.67) 
a k 
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Let now 


G(z) = G(z,6) = Dg G)z* 
k=1 


be the corresponding estimator of the transfer function G(z, 0) = } 72; ge (0)z*. 
In the Model Error Modelling framework, it is assumed that the “true” transfer 
function G(z) is only partially captured by the chosen model class 4⁄4, (0) so that 


G(z) = G(z, 6) + G(z) G(z, 6) € (0) (5.68) 


and G(z) represents a model error. The purpose of Model Error Modelling is to 
obtain a statistical description of the model error, say 


G(z) := G) — G(z) 


which may be used, for instance, to estimate the model order, e.g., the dimension n 
of the parameter vector 0. This can be achieved by minimizing an estimate of the 
MSE 

EG(2) — Gz, MIP? 


while accounting for the model error model G(z), see e.g., Eqs. (89)—-(92) in [26]. 
The model error G(z) is estimated in [26] starting from the least squares resid- 
uals va(t) := y(t) — Gz, 6)u(t) which, under assumption (5.68), is expected to be 
described by the model 7 
v(t) = G(z)u(t) + e(t). 


It is remarkable that [26] propose to estimate the parameters a and p that characterize 
the covariance (5.66) resorting to marginal likelihood maximization 


(â, Ô) := arg max | PVIÐDPGIE p) dg, (5.69) 
a,p 


where Vp := [va(1), ..., va (N )]. It is also interesting to observe that the exponential 
decay of the covariance sequence (5.66) implies a smoothness condition in the fre- 
quency response function similar in spirit to that advocated in [1]. This is formalized 
in the following result whose proof is in Sect. 5.10.7. 


Lemma 5.5 Let {g¢.a}c=0 
(5.66) and let 


œ be a zero-mean Gaussian process with covariance 


CO 
Gale”) = Y` grae”? w € [0, 27) 
k=0 


be its Fourier transform. Then the Lipschitz-like condition 
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Ell|Ga(el") — Gali] < -G@1— a)? lal <1 (5.70) 


holds. 


5.9 Further Topics and Advanced Reading 


Section 1.3 already reported a list of topics and readings on inverse problems, Stein 
estimators and their link with the Empirical Bayes framework. 

The use of regularization and Bayesian priors can be probably dated back to 
the paper [71] were smoothing ideas have been advocated for a denoising problem 
in the field of Actuarial Science. See also the much later reference [34]. The later 
developments are essentially impossible to survey in this short section and we refer 
the reader to [66] for an early overview on the use of Bayes priors in the context of 
linear regression; the interested reader may also consult [22, 31, 32, 42, 59] where 
generalized ridge regression has been proposed to stabilize ill-conditioned inverse 
problems. 

To the best of our knowledge, [40, 62] have been the first to use these ideas in 
the context of dynamical systems, named “distributed-lag” models in these early 
references. This work has been subsequently taken up by Akaike [1] and later on 
by Kitagawa and Gersh in a series of papers, see e.g., [35, 36], which culminated 
in the well-known book [37]. The seminal papers by Leamer and Shiller have also 
been continued by the econometrics community, starting with the work by Doan, 
Litterman and Sims, see e.g., [18] for an overview and further references. This has 
lead to the so-called “Minnesota prior”, which has been discussed quite extensively 
in the econometrics literature; several variations and extensions are found, see for 
instance [23, 41]. 

The econometrics literature has since then studied Bayesian procedures for system 
identification rather intensively, mostly under the acronym Bayesian VARs; the main 
driving motivation was that of handling high-dimensional time series (i.e., p large, 
called cross-sectional dimension in the econometrics literature) with possibly many 
explicative variables (m large), see for instance [2, 17, 23, 38]. 

The problem of tuning the regularization parameters (or equivalently the hyper- 
parameters describing the prior in a Bayesian setting) has received relatively little 
attention in the econometrics literature: [40] already suggested the use of Empirical 
Bayes procedure, while [2, 18] propose tuning the hyperparameters using out-of- 
sample and in-sample errors, respectively. The paper [38] and the most recent work 
[23] adopt again an Empirical Bayes approach using the marginal likelihood; [23] 
claims the superiority of this approach w.r.t. previous “ad-hoc” techniques [2, 18]. 

Despite this long history, the use of Bayesian priors for system identification 
has only gained popularity in relatively recent times, e.g., see the survey [52]. We 
believe it is fair to say that reason for this is to be attributed to the fact that much more 
efforts have been recently devoted to developing prior models tailored to estimating 
dynamical system. In the remaining part of the book, these issues will be dealt with 
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in some details. The reader is referred to [10, 11, 49, 50, 55] for various classes of 
prior models and to [6, 7, 12, 55] for more details on Maximum Entropy derivations. 
Extensions include prior models to estimate sparse models for high-dimensional 
time series [14, 74] as well as classes of priors for nonlinear dynamical models [51], 
that will be thoroughly discussed in Chap. 8. In particular, the techniques described 
in this chapter can be also used to identify the so-called dynamic networks that 
consist of a large set of interconnected dynamic systems. Modelling such complex 
physical systems is important in several fields of science and engineering, including 
also biomedicine and neuroscience [27, 30, 46, 56]. Estimation is difficult since 
they are often large scale and the network topology is typically unknown [14, 44, 
67]. One typically postulates the existence of many connections and then has to 
understand from data which are really active. Since in real physical systems often 
only asmall fraction of links is really working, the estimation process needs to exploit 
sparsity regularizers as those introduced in Chap. 3 and their stochastic interpretation 
like the Bayesian Lasso [47]. In the context of linear dynamic networks, where 
modules are defined by impulse responses, many approaches have been recently 
designed e.g., relying on local multi-input single-output (MISO) models [16, 19, 
45]. Contributions based on variational Bayesian inference and/or nonparametric 
regularization, deeply connected with the techniques discussed in this book, are 
in [14, 33, 58, 73]. Methods to infer the full network dynamics using (structured) 
multiple-input multiple-output (MIMO) models can be found instead in [21, 69], with 
estimates consistency analyzed in [57]. A contribution based on the combination of 
the stable spline kernel and the so-called horseshoe sparsity prior [8, 54, 68] has 
been developed in [48]. See also [3, 24, 29, 70] for insights on identifiability issues 
and [28] where compressed sensing is exploited. 


5.10 Appendix 
5.10.1 Optimal Kernel 


Theorem 5.7 The solution P* of problem (5.20) is given by 
P* = 662, (5.71) 
where 0 is the “true” impulse response of the data-generating mechanism (5.14). 


Proof The proof will proceed as follows: let us denote with 6°" the estimator 
obtained with P as in (5.71). Consider the error 


6? := ôP —& 


which can be written as 
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Ea Oo 
P Oo $ 6P = gP* 
= gP + (9p — ôr") 3 
We shall show that the following orthogonality property holds: 


bor” (4p = ar)’ =0 (5.72) 


so that . 
6°")? = G6" PYLE (4p z 6”) (êr — 6”) (5.73) 


and therefore: 
=P iz P\T z P* 3 P*T A APNA pp*\T 
Mo (P) — Mo (P*) = G6? 6")? — eP” (8P) = & (6p -6 ) (êp -ô ) =0 


which will prove the claim that P* = 06 is the optimal solution to (5.20). 
It now just remains to show that (5.72) holds. To do so, let us rewrite (4.7) assuming 
null uo and using the matrix inversion lemma as (3.145): 


= (0? + P&S) ' POTY 

= (0? + POS) POT (OO) + E) 

= (071 + POTD) '[(PO™S +071 — 071) + POTE] 
= 6) — (071 + POTE) [06 — POE]. 


Therefore, the error 6P := o — 6? can be written in the form: 


6? = (071 + POTD) [076 — POTE] = Wp [o> — POTE]. (5.74) 
a 
:=Wp 
Now, using (5.74), we have: 
ô?’ — 6p = 6p — 0” = 0? (Wp — Wr-) 0) + (Wp P — Wp P) D'E. 


Now, let us compute 


E (87 = êr) (6?*)T = 04 (Wp — Wp) 007 WE, — o? (Wps P — Wp K) DTO P*WT, 
= 07 [o° (Wp — Wes) — (Wpx P* — Wp P) DTO] P* Why l 
(5.75) 
If we now use the identity 


Wp (o°I + PTO) =1 > o’Wp=I-— WpPO"O 
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we obtain 
o’ (Wp — Wr) = (Wp: P* — Wp P) DTO 


so that, using (5.75), 
£ (6r = 6p) 6°")? =0 


which proves (5.72) and thus the theorem. 


5.10.2 Proof of Lemma 5.1 


Consider the following upper bound on the probability that the 2; norm of @ be larger 
than a given threshold Tp, : 


[0,0] 


0O [e.e] oO 
apa > Ta < 7,6 o = 7 2 lb < z Lm + V/m KE, D?) 
t=1 t=1 


1 
1 6 =] l ¢=1 


where we have used the equality &|X| = 0./2/m for X ~ M (0, o°). Using the 
hypothesis (5.24) we have that 


[0,6] 
M, Ke J2 
apa DE rs < Me + Ko v2/0 


T, 
t=1 4 


and therefore 


Te 


[0,6] 
Me, + Ke JIT 
ORTE MEMNU 


t=1 1 


Taking the limit as Tọ, —> +00 we have 


[0.6] 
[yor < +~] =1 
p=] 


which concludes the proof. 


5.10.3 Proof of Theorem 5.5 


The proof is based on the fact that the Maximum Entropy distribution p(@) under 
constrains & f,(0) = Fy and &g,(0) = Gg has the “Gibbs” structure, i.e., it is the 
exponential of a weighted sum of the constraint functionals (see e.g., [15]): 
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p0) œ e7 D; ui fi (@)+yi gi (0) | 


In our case, we have f(0) = 6? and g (0) = (1 — ,)7, and therefore the max-ent 
solution has the form 


Po meE(0) = Ce i (m0? +} p- MOR FYE (OK =1 o), (5.76) 


Using a well-known result in graphical models (see e.g., Lauritzen [39]), the variables 
Ox and {0k+2, . . . , On} are conditionally independent given 0;,, (because 6,41 is the 
only neighbour of @p in the graph representing p(0k, 41, ..-, On) (or equivalently 
Ox41 Separates 0p from 0,42, 0443, --+, On). 

In our case, this conditional independence implies that the best linear estimator 
6,1 of &_1 given Ok, Ok+1, -- -, On depends only 6p (i.e., 6,1 = ag_4O,) so that the 
vector @ admits the f? representation: 


Ok—1 = 4B KOK + Wk (5.77) 


with wg := 1 — 6,1 = Op-1 — ap.0, zero mean and uncorrelated of &, 
Ok+1, <- -, On. Let us define of = Ewr. In order to express ag x and of as a function 
of AR, às, æ, we exploit the constraints (5.31) and the dynamical model (5.77). In 
particular we have 


Asat! = E0 
= dy 602 + of (5.78) 
= ab Asa” + of 


Arak! = E (Ok — 0)? 
= E ((ag,k — 1)0} + w)? 
= E (ag k — 1) 0? + Ewe’ 
= (ape — 1)?Asok + of 


(5.79) 


Substracting (5.78) from (5.79) we obtain 
(As — Ar)a*! = ab rsa” — (agx — 1) àsa" = (ag — 1)àsa" 


which implies that 
_ As(1 +0) — àr 


aBnk = — 2 =: 4B 
25a 


that is independent of k, thus denoted with ag as in (5.35). From (5.79) we also have 
that 


? We prefer here to work with backward representations since, as we will see, with this choice we 
will have ag, = ag, independent of k. Forward representations are discussed in Sect. 5.10.8. 
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of = àra! — (ag — 1} Asat = (Ar — (ag — 1)*Asa)ar®—! = (1 — aĝa)à sat! 
where the last equality follows after a few manipulations and proves (5.36). Replacing 


às — a) — Ar 
aB l= 
25a 


in the previous equation we have: 


As(1—a) — Ar)? 
of = ÀR sf a) 8 Asa eo. 
2Às& 


Of course of, and thus the right hand side, should be positive (for simplicity we 
exclude the singular case of = 0): 


0 


n (Ps0 amar), _ Ards — As(l =a) — Ar)? 
R 2À sæ ‘ 4A sa 


which in turn is equivalent to 
4A ràÀsa — (As(1 — æ) — ÀR}? > 0. 
This happens if and only if 
A — 2àràs(l +a) + (l — œ)13 <0. 


This is a degree two polynomial in àg with two positive roots 


Ari =AsA ta) = (ad +a)? + M1 —ay?=Asdt+at2/a) i=1,2 
and therefore our problem is feasible if and only if 

Armin = Ari = Às(1 +0 —2Sar) < àr < às(1 +a +20) = AR,2 = ÀR max 
thus proving (5.32). Now it remains to prove that (5.76) takes the form (5.34). First 


let us observe that the exponent of (5.76) is a quadratic form in 0, and therefore 
(5.76) can be written in the form 


Po,mx(0) = Ce?” 9, 


Last, since in (5.76) only products of the form 0,0, for h € [k — 1, k, k + 1] appear, 
the matrix @ = @7 has the following band structure: 
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In addition, for pọ, mg (0) to be a density, ® needs to be positive semidefinite (oth- 
erwise there would be directions in which the density grows indefinitely). Since 0 
admits the backward AR representation (5.77) with @wz = of > 0, the covariance 
matrix X = £00" is positive definite and thus = ¥~!. To compute the autoco- 
variance function 66,6, we consider the following cases: if k = h we have 

E OnOn = Aga” x 


If k > h we have 
E OnOk = ApEOn +19 


and iterating the relation we find 
EO, = ak "E00 = Asak "aĂ. 
Analogously, if h > k we have 
E60, = an * 80,0, = Asah ta". 
Combining the three cases we obtain 
E06, = er "l gmat h 


proving (5.38). 


5.10.4 Proof of Corollary 5.1 


Using the definition (5.39) in Eq. (5.38) we obtain: 


|k—h| 
|k—h|_ max{k,h p max{k,h k—h| E 
EXO, = Asap Q th — Asse & {kh} — Asp! lz, 


In addition, if the matching condition Ar = As(1 — @) is satisfied, then from (5.35) 
ag = 1 and from (5.39) p = ./a; substituting in (5.40) we obtain 
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a |k—h| He max{k,h} 
E OrOn = Asp Q? = Asa 


i.e., the covariance sequence of the well known TC kernel. 


5.10.5 Proof of Lemma 5.2 


The proof of this lemma is a simple application of Schwartz inequality. In particular 
we have: 


luil = El Xs- K] sisl < Doar KIIK]; luis 
VIK] Dy VIK] sliil < EKE, N) J VIK] sD luisl?, 
s=1 s=1 


IA 


where the last inequality follows from the fact that u;, has 2-norm equal to 1 for 
all i. The same condition clearly holds also in the infinite dimensional case, i.e., as 
n — oo if K (t, s) admits the spectral decomposition 


[0,6] 
K(t,s) = X Eilis 
i=l 


and the condition `, K (t, t) = C < ooholds. In particular this latter condition holds 
true if the more stringent condition `, K '2(¢, t) < oo in Lemma 5.1 is satisfied. 


5.10.6 Proof of Theorem 5.6 


The proof follows from fact that the Maximum Entropy distribution p(x) under 
constrains & f;(x) < y; has the “Gibbs” structure, i.e., it is the exponential of a 
weighted sum of the constraint functionals (see e.g., [15]): 


p(x) xe D Mi fia) 


5.10.7 Proof of Lemma 5.5 


Since {8k }x=0.....00 1S zero mean, then clearly also Gy (e/”) is so, i.e., £Ga (ef?) = 0. 
If we now consider the difference 
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[o0] 
Gale!) — Galel) = a e e; 
k=0 


taking the expected value of the squared norm, and using the fact the & gk 08k. = 
ca*d,_,, we have 


CO 
E| Galei) — Galei)? = X` cork je — e7], 
k=0 


Now, using 
lle Jk! — e]? — 2 (1 — cos(@ — 2)) < (1 — an)” 


and the expression for the sum of the geometric series a* the thesis follows. 


5.10.8 Forward Representations of Stable-Splines Kernels x 


A major drawback of the backward construction is that it is not straightforward to 
extend it to an infinite interval, i.e., to let n — oo in order to consider infinitely 
long impulse response models {0;},¢. However this difficulty can be circumvented 
exploiting the “forward” representation of (5.77), which turns out to be again a time 
varying AR(1) model.? Theorem 5.8 derives the forward AR(1) representation of the 
maximum entropy process found in Theorem 5.5. 


Theorem 5.8 The maximum entropy solution to (5.33) found in Theorem 5.5 admits 
the forward AR(1) representation 


Ok+1 = aFOk + Wk k>0 (5.80) 
with zero-mean initial condition such that & o = Xs, and where 
2 


ap = pa!’ = aga (5.81) 


and wx is a sequence of zero mean variables, uncorrelated with the initial condition 
0o and such that 


2 = 
Eww), = a = (5.82) 


with of , = àsatt! (1 — p°). 


3 There are several ways to see this: perhaps the simplest is to recall that the inverse covariance 
matrix of an AR(1) process has a band (tridiagonal) structure, which implies that forward and 
backward models share the same conditional dependence structure. 
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Proof First of all let us observe that, if 0, admits an AR(1) forward representation 
of the form (5.80) (with w; that satisfies (5.82)), ar should satisfy the relation 


=i 
ar = E010 (ER) 
Using the expression (5.38), we obtain: 


E Ok+10k (E02 — Asapa*! (asat) ' = apa 


1/2 


and recalling that p = aga’’~ we also obtain 


ar = aga = pa’. 
In addition, denoting OF p= w?, 
E04) = ApEO + OR, 
must hold. Therefore, 
a = E02, — a80? = àsaft! — pack =isga® (1 — py. 
It also straightforward to verify that, if 6, is generated by (5.80), then 
E Op 420 = ay CO? = af Asat = Agata’ t>0 
which is exactly of the form 


EnO = dso arat 


provided h = k + t, t > 0. This concludes the proof. 
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Chapter 6 A) 
Regularization in Reproducing Kernel geit 
Hilbert Spaces 


Abstract Methods for obtaining a function g in a relationship y = g(x) from 
observed samples of y and x are the building blocks for black-box estimation. The 
classical parametric approach discussed in the previous chapters uses a function 
model that depends on a finite-dimensional vector, like, e.g., a polynomial model. 
We have seen that an important issue is the model order choice. This chapter describes 
some regularization approaches which permit to reconcile flexibility of the model 
class with well-posedness of the solution exploiting an alternative paradigm to tra- 
ditional parametric estimation. Instead of constraining the unknown function to a 
specific parametric structure, the function will be searched over a possibly infinite- 
dimensional functional space. Overfitting and ill-posedness are circumvented by 
using reproducing kernel Hilbert spaces as hypothesis spaces and related norms as 
regularizers. Such kernel-based approaches thus permit to cast all the regularized esti- 
mators based on quadratic penalties encountered in the previous chapters as special 
cases of a more general theory. 


6.1 Preliminaries 


Techniques for reconstructing a function g in a functional relationship y = g(x) 
from observed samples of y and x are the fandamental building blocks for black-box 
estimation. As already seen in Chap. 3 when treating linear regression, given a finite 
set of pairs (x;, y;) the aim is to determine a function g having a good prediction 
capability, i.e., for a new pair (x, y) we would like the prediction g(x) close to y 
(e.g., in the MSE sense). 

The classical parametric approach discussed in Chap. 3 uses a model gg that 
depends on a finite-dimensional vector 0. A very simple example is a polynomial 
model, treated in Example 3.1, given, e.g., by gg(x) = 0; + 02x + 03x? whose coef- 
ficients 0; can be estimated by fitting the data via least squares. In this parametric 
scenario, we have seen that an important issue is the model order choice. In fact, the 
least squares objective improves as the dimension of 0 increases, eventually leading 
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to data interpolation. But overparametrized models, as a rule, perform poorly when 
used to predict future output data, even if benign overfitting may sometimes happen, 
as e.g., described in the context of deep networks [17, 55, 75]. Another drawback 
related to overparameterization is that the problem may become ill-posed in the sense 
of Hadamard, i.e., the solution may be non-unique, or ill-conditioned. This means 
that the estimate may be highly sensitive even to small perturbations of the outputs 
yi as, e.g., illustrated in Fig. 1.3 of Sect. 1.2. 

This chapter describes some regularization approaches which permit to recon- 
cile flexibility of the model class with well-posedness of the solution exploiting an 
alternative paradigm to traditional parametric estimation. Instead of constraining the 
unknown function to a specific parametric structure, g will be searched over a possibly 
infinite-dimensional functional space. Overfitting and ill-posedness is circumvented 
by using reproducing kernel Hilbert spaces (RKHSs) as hypothesis spaces and related 
norms as regularizers. Such norms generalize the quadratic penalties seen in Chap. 
3. In this scenario, the estimator is completely defined by a positive definite kernel 
which has to encode the expected function properties, e.g., the smoothness level. 
Furthermore we will see that, even when the model class is infinite dimensional, the 
function estimate turns out a finite linear combination of basis functions computable 
from the kernel. The estimator also enjoys strong asymptotic properties, permitting 
(under reasonable assumptions on data generation) to achieve the optimal predictor 
as the data set size grows to infinity. 

The kernel-based approaches described in the following sections thus permit to 
cast all the regularized estimators based on quadratic penalties encountered in the 
previous chapters as special cases of a more general theory. In addition, RKHS theory 
paves the way to the development of other powerful techniques, e.g., for estimation 
of an infinite number of impulse response coefficients (IIR models estimation), for 
continuous-time linear system identification and also for nonlinear system identifi- 
cation. 

The reader not familiar with functional analysis finds in the first part of the 
appendix of this chapter a brief overview on the basic results used in the next sec- 
tions, like, e.g., the concept of linear and bounded functional which is key to define 
a RKHS. 


6.2 Reproducing Kernel Hilbert Spaces 


In what follows, we use 2 to indicate domains of functions. In machine learning, 
this set is often referred to as the input space with its generic element x € 2 called 
input location. Sometimes, 2 is assumed to be a compact metric space, e.g., one 
can think of 2 as a closed and bounded set in the familiar space R” equipped 
with the Euclidean norm. In what follows, all the functions are real valued, so that 


f: FOR. 
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Reproducing kernel Hilbert spaces We now introduce a class of Hilbert spaces #7 
which play a fundamental role as hypothesis spaces for function estimation problems. 
Our goal is to estimate maps which permit to make predictions over the whole 2’. 
Thus, a basic requirement is to search for the predictor in a space containing functions 
which are well-defined pointwise for any x € 2. In particular, we assume that all 
the pointwise evaluators g —> g(x) are linear and bounded over .#. This means that 
Vx e & there exists C, < oo such that 


Is@)| < Crllglle, Yg EZ. (6.1) 


The above condition is stronger than requiring g(x) < oo Vx since C, can depend 
on x but not on g. This property already leads to the function spaces of interest. The 
following definitions are taken from [13]. 


Definition 6.1 (RKHS, based on [13]) A reproducing kernel Hilbert space (RKHS) 
over a non-empty set 2 is a Hilbert space of functions g : 2 — R such that (6.1) 
holds. 


As suggested by the name itself, RKHSs are related to the concept of positive 
definite kernel [13, 20], a particular function defined over 2 x 2’. In the literature 
it is also called positive semidefinite kernel, hence in what follows positive definite 
kernel and positive semidefinite kernel will define the same mathematical object. 
This is also specified in the next definition. 


Definition 6.2 (Positive definite kernel, Mercer kernel and kernel section, based on 
[13]) Let 2 denote a non-empty set. A symmetric function K : 2 x X — Ris 
called positive definite kernel or positive semidefinite kernel if, for any finite natural 
number p, it holds 


P P 
Xd aja;K (x;,%;)=0, Yara) E (X,R), k=1,...,p. 


If strict inequality holds for any set of p distinct input locations xz, i.e., 


2 


i=l j 


aiaj K (xi, xj) > 0, 


DM 


1 


then the kernel is strictly positive definite. 
If 2 is a metric space and the positive definite kernel is also continuous, then K 
is said to be a Mercer kernel. 
Finally, given a kernel K, the kernel section K, centred at x is the function 
X — R defined by 
K) = Ka, y) Wye &. 
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Hence, in the sense given above, a positive definite kernel “contains” matrices which 
are all at least positive semidefinite. 

We are now in a position to state a fundamental theorem from [13] here specialized 
to Mercer kernels which lead to RKHSs containing continuous functions (the proof 
is reported in Sect. 6.9.2). 


Theorem 6.1 (RKHSs induced by Mercer kernels, based on [13]) Let 2 be a com- 
pact metric space and let K : X x X — R bea Mercer kernel. Then, there exists 
a unique (up to isometries) Hilbert space # of functions, called RKHS associated 
to K, such that 


1. all the kernel sections belong to #, i.e., 
K,€ 2 Wx Ee X; (6.2) 
2. the so-called reproducing property holds, i.e., 
(Kx, 8) = g(x) Vx, g) E(X, HW). (6.3) 


In addition, Z is contained in the space © of continuous functions. 


Remark 6.1 Note that the space # characterized in Theorem 6.1 is indeed a RKHS 
according to Definition 6.1. In fact, for any input location x the kernel section K, 
belongs to the space and, according to the reproducing property, represents the evalu- 
ation functional at x. Then, Theorem 6.27 (Riesz representation theorem), reported in 
the appendix to this chapter, permits the conclusion that all the pointwise evaluators 
over # are linear and bounded. 


While Theorem 6.1 establishes a link between Mercer kernels (which enjoy conti- 
nuity properties) and RKHSs, it is possible also to state a one-to-one correspondence 
with the entire class of positive definite kernels (not necessarily continuous). In par- 
ticular, the following result holds. 


Theorem 6.2 (Moore—Aronszajn, based on [13]) Let 2 be any non-empty set. Then, 
to every RKHS # there corresponds a unique positive definite kernel K such that 
the reproducing property (6.3) holds. Conversely, given a positive definite kernel K, 
there exists a unique RKHS of real-valued functions defined over 2 where (6.2) and 
(6.3) hold. 


The proof can be quite easily obtained using Theorem 6.27 (Riesz representation 
theorem) and arguments similar to those contained in the proof of Theorem 6.1. 


Further notes and RKHSs examples Thus, a RKHS # can be defined just by 
specifying a kernel K, also called the reproducing kernel of #. In particular, any 
RKHS is generated by the kernel sections. More specifically, let S = span({K,},.e2°) 
and define the following norm in S 
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p p 
IFI =X do cici K Gi, xj), (6.4) 
i=1 j=l 
where 
Pp 
f= GKO. 
i=l 
Then, one has 
H = S U {all the limits w.r.t. || - ||% of Cauchy sequences contained in S}. 


Summarizing, one has 


e all the kernel sections K, (-) belong to the RKHS # induced by K; 

e contains also all the finite linear combinations of kernel sections along with 
some particular infinite sums, convergent w.r.t. the norm (6.4); 

e every f € # is thus a linear combination of a possibly infinite number of kernel 
sections. 


Assume for instance K (x1, x2) = exp (—llxı — x2||?), which is the so-called 
Gaussian kernel. Then, all the functions in the corresponding RKHS are sums, 
or limits of sums, of functions proportional to Gaussians. As further elucidated later 
on, this means that every function of .# inherits properties such as smoothness and 
integrability of the kernel, e.g., we have seen in Theorem 6.1 that kernel continu- 
ity implies # C @. This fact has an important consequence on modelling: instead 
of specifying a whole set of basis functions, it suffices to choose a single positive 
definite kernel that encodes the desired properties of the function to be synthesized. 


Example 6.3 (Norm in a two-dimensional RKHS) We introduce a very simple 
RKHS to illustrate how the kernel K can be seen as a similarity function that estab- 
lishes the norm (complexity) of a function by comparing function values at different 
input locations. 

When 2% has finite cardinality m, the functions are evaluated just on a finite 
number of input locations. Hence, each function f is in one-to-one correspondence 
with the m-dimensional vector 


fO) 
f2 


fmn) 
In addition, any kernel is in one-to-one correspondence with one symmetric positive 


semidefinite matrix K € R”*” with (i, j)-entry K;; = K (i, j). Finally, the kernel 
sections can be seen as the columns of K. 
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Assume, e.g., m = 2 with 2 = {1,2}. Then, the functions can be seen as two- 
dimensional vectors and any kernel K is in one-to-one correspondence with one 
symmetric positive semidefinite matrix K € R?*?. The RKHS .# associated to K 
is finite-dimensional being spanned just by the two kernel sections K(-) and K2(-) 
which can be seen as the two columns of K. Hence, the functions f in # are in 
one-to-one correspondence with the vectors 


If K is full rank, #% covers the whole R? and from (6.4) we have 
Ifl =c" Kc = fK. 


For the sake of simplicity, assume also that K;; = Kz = | so that it must hold 
—1 < Kj < 1. Then, considering, e.g., the function f (i) = i, one has 


IfI =U 2K 2" 
5 — 4K), 
= TT” —l < Kp <1. 
12 
Figure 6.1 displays || f ee as a function of K;2. One can see that the norm diverges 
as |Kj2| approaches 1. 

If, e.g., Kj. = 1 the kernel function becomes constant over 2 x 2. Hence, the 
two kernel sections K; (-) and K2(-) coincide, being constant with Kı (i) = Koi) = 1 
for i = 1, 2. This means that Kj. = 1 induces a space # containing only constant 
functions.' This explains why the norm (complexity) of f becomes large if Kj, is 
close to 1: the space becomes less and less “tolerant” of functions with f(1) Æ f(2). 

Letting now f(1) = 1 and f(2) = a, the joint effect of Ky). and a is explained by 
the formula 


fie =U a] K' [1 al” 
— (a—Ky)’ 
1 — Ki, 


+1, 1 < Kj, < 1. 


Note that, thinking now of Kj as fixed, the function with minimal RKHS norm 
(complexity) is obtained with a = Kj and has a norm equal to one. 


Example 6.4 (Z! and £2) Let X = R and consider the classical Lebesgue space 
of square summable functions with j: equal to the Lebesgue measure. Recall that this 
is a Hilbert space whose elements are equivalence classes of functions measurable 


' One can then also easily check that the case K12 = —1 instead induces a RKHS containing only 
functions satisfying f(1) = — f (2). 
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Fig. 6.1 The figure plots RKHS squared norm 
| fl2,. with f) = i and 250 ] | i 
i € {1, 2}, as a function of 
the kernel value K (1, 2), eoor 
having set 
K(1,1)= K(2,2)=1 
ü, j= x02 P 
100 
50 
> | 
-0.5 0 0.5 
Kio 


w.r.t. Lebesgue: any group of functions which differ only on a set of null measure 
(e.g., containing only a countable number of input locations) identifies the same 
vector. Hence, Z” cannot be a RKHS since pointwise evaluation is not even well 
defined. 

Let instead 2° = N (the set of natural numbers) and define the identity kernel 


K(i, jf) =6;, J) ENXN, (6.5) 


where 0;; is the Kronecker delta. Clearly, K is symmetric and positive definite accord- 
ing to Definition 6.2 (it can be associated with an identity matrix of infinite size). 
Hence, it induces unique RKHS # that contains all the finite combinations of the 
kernel sections. In particular, any finite sum can be written as f(-) = X; fiKi(), 
where some of the f; may be null, and corresponds to a sequence with a finite number 
of non null components. To obtain the entire .#, we need also to add all the Cauchy 
sequences limits w.r.t. the norm (6.4) given by 


m 2 
DAKO 
i=l He 


m m m 


23 fap) I. 


i=l j=l i=l 


2 
IFI = 


which coincides with the classical Euclidean norm of [fi ... fm]. This allows us to 
conclude that the associated RKHS is the classical space 3 of square summable 
sequences. 

As a finale note, Definition 6.1 easily confirms that £2 is a RKHS. In fact, for 
every f = [fi fo ...] € & one has 
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fl< [OR =I fll Vi, 


and, recalling (6.1), this shows that all the evaluation functionals f — f; withi € N 
are bounded. 


Example 6.5 (Sobolev space and the first-order spline kernel) While in the previous 
example we have seen that .Y}' is not a RKHS, consider now the space obtained by 
integrating the functions in this space. In particular, let 2° = [0, 1], set u to the 
Lebesgue measure and consider 


H = |; | f(x) = [roa with h € ah, 
0 


One thus has that any f in # satisfies f (0) = 0 and is absolutely continuous: its 
derivative h = f is defined almost everywhere and is Lebesgue integrable. 
With the inner product given by 


(Dæ = (èz 
it is easy to see that # is a Hilbert space. In fact, 4" is Hilbert and we have 
established a one-to-one correspondence between functions in # and 2} which 


preserves inner product. Such # is an example of Sobolev space [2] since the 
complexity of a function is measured by the energy of its derivative: 


1 
ne -f Pei. 


Now, given x € [0, 1], let xx (+) be the indicator function of the set [0, x]. Then, one 
has 


If) = f f@dal = |(xx, f) el 


< Ifl = Ifill, 


where we have used the Cauchy—Schwarz inequality. Hence, # is also a RKHS 
since all the evaluations functionals are bounded. We now prove that its reproducing 
kernel is the so-called first-order (linear) spline kernel given by 


K(x, y) = min(x, y). (6.6) 


In fact, every kernel section belongs to #, being piecewise linear with K, = Xx- 
Furthermore, (6.6) satisfies the reproducing property since 
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Linear Spline Kernel i Linear Spline Kernel Sections 


0 0 o 0.2 0.4 0.6 0.8 1 


Cubic Spline Kernel Cubic Spline Kernel Sections 


0 o 0 0.2 0.4 0.6 0.8 1 


Fig. 6.2 Linear and cubic spline kernel with kernel sections Ky; (x) for x; =0.1,0.2,...,1 
(bottom) 


(S, Kx) = (Xx). 


= J f(y)dy = f(x). 


The linear spline kernel and some of its sections are displayed in the top panels of 
Fig. 6.2. 


6.2.1 Reproducing Kernel Hilbert Spaces Induced 
by Operations on Kernels x 


We report some classical results about RKHSs induced by operations on kernels 
which can be derived from [13]. The first theorem characterizes the RKHS induced 
by the sum or product of two kernels. 
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Theorem 6.6 (RKHS induced by sum or product of two kernels, based on [13]) Let 
K and G be two positive definite kernels over the same domain X x X , associated 
to the RKHSs # and G, respectively. 

The sum K + G, where 


[K + G](x, y) = K(x, y) + G(x, y), 
is the reproducing kernel of the RKHS & containing functions 
f=h+g, (ge Hx 


with 
Fle = min, Walle + lelg st f= h+ g. 
The product K G, where 
[KG](x, y) = K(x, y)G (x, y) 
is instead the reproducing kernel of the RKHS & containing functions 


f=hg, heave HxG 


with 
2 . 2 2 
= min h\|é, s.t. f = hg. 
IRAI gy! læ lelg s-t. f= hg 


The second theorem instead provides the connection between two RKHSs, with 
the second one obtained from the first one by sampling its kernel. 


Theorem 6.7 (RKHS induced by kernel sampling, based on [13]) Let # be the 
RKHS induced by the kernel K : 2 x X > R. Let Y C X and denote with Z 
the RKHS of functions over Y induced by the restriction of the kernel K onY x %. 
Then, the functions in & correspond to the functions in f sampled on Y. One also 
has 


If = min lgl st gy = f, (6.7) 


where gy is g sampled on Y. 


The following theorem lists some operations which permit to build kernels (and 
hence RKHSs) from simple building blocks. 


Theorem 6.8 (Building kernels from kernels, based on [13]) Let K; and K2 two pos- 
itive definite kernels over X x X and K; a positive definite kernel over R” x R”. 
Let also P anm x m symmetric positive semidefinite matrix and P(x) a polynomial 
with positive coefficients. Then, the following functions are positive definite kernels 


over & x X: 
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K(x, y) = Ki (x, y) + Ko(x, y) (see also Theorem 6.6). 
K(x, y) =aK\(x,y), a>=>0O. 

K(x, y) = Kı (x, y)K2(x, y) (see also Theorem 6.6). 
K@,y=f@fO), fiV >R. 

K(x, y) = K3(f(x), fOD, f: Z >R". 

K(x, y) =xT Py, X =R". 

K(x, y) = P(Kı (x, y)). 

K(x, y) = exp(Ki(, y)). 


6.3 Spectral Representations of Reproducing Kernel 
Hilbert Spaces 


In the previous section we have seen that any RKHS is generated by its kernel 
sections. We now discuss another representation obtainable when the kernel can be 
diagonalized as follows 


K(x, y) = >> Gawa), G > Oi, (6.8) 


ie SZ 


where the set X is countable. This will lead to new insights on the nature of the 
RKHSs, generalizing to the infinite-dimensional case the connection between regu- 
larization and basis expansion reported in Sect. 5.6. 

A simple situation holds when the input space has finite cardinality, e.g., 2 = 
{x1 ...Xm}. Under this assumption, any positive definite kernel is in one-to-one cor- 
respondence with the m x m matrix K whose (i, j)-entry is K (x;, xj). The repre- 
sentation (6.8) then follows from the spectral theorem applied to K. In fact, if ¢; and 
v; are, respectively, the eigenvalues and the orthonormal (column) eigenvectors of 
K, (6.8) can be written as 


m 


T 
K= X CiVi¥; 5 
i=l 


where the functions p;(-) have become the vectors v;. One generalization of this 
result is described below. 
Let Lx be the linear operator defined by the positive definite kernel K as follows: 


Lal f= f Ke DFAA) (6.9) 


We also assume that ju is a o-finite and nondegenerate Borel measure on 2. Essen- 
tially this means that 2% is the countable union of measurable sets with finite measure 
and that u “covers” entirely 2°. The reader can, e.g., consider 2 C R” and think of 
ju as the Lebesque measure or any probability measure with j1(A) > O for any non- 
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empty open set A C 2. The next classical result goes under the name of Mercer 
theorem whose formulations trace back to [60]. 


Theorem 6.9 (Mercer theorem, based on [60]) Let 2 be a compact metric space 
equipped with a nondegenerate and o-finite Borel measure u and let K be a Mercer 
kernel on 2 x X. Then, there exists a complete orthonormal basis of L} given by 
a countable number of continuous functions {p;}ic.g satisfying 


Lxlpl=Ger, iE, OG>G>--- 20, (6.10) 


with Qi > OVi if K is strictly positive and lim;-, ə Çi = 0 if the number of eigenvalues 
is infinite. 
One also has 
K(x, y=} Gawa), (6.11) 


ics 
where the convergence of the series is absolute and uniform on 2 x X. 


The following result characterizes a RKHS through the eigenfunctions of Lx. 
The proof is reported in Sect. 6.9.3. 


Theorem 6.10 (RKHS defined by an orthonormal basis of L!) Under the same 
assumption of Theorem 6.9, if the pi and ¢; satisfy (6.10), with also Ç; > 0 Vi, one 
has 


2 
H = |y | f= J capa st > < o, | (6.12) 


ie SF ies > 


In addition, if 


f= X aipi, g= X bipi, 


ices ieJ 
one has E 
Gj Oj 
(fade =o a (6.13) 
ief 4 
so that 
2 a; 
Ifl =) 4. (6.14) 
ief Gi 


Hence, it also comes that {./G; pi}ic.g is an orthonormal basis of #. 


The representation (6.12) is not unique since the spectral maps, i.e., the functions 
that associate a kernel with a decomposition of the type (6.8), are not unique. They 
depend on the chosen measure pu even if they lead to the same RKHS. 

Theorem 6.10 thus shows that any kernel admitting an expansion (6.11) coming 
from the Mercer theorem induces a separable RKHS, i.e., having a countable basis 
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given by the p;. Later on, Theorem 6.13 will show that such result holds under much 
milder assumptions. In fact, the representation (6.12) can be obtained starting from 
any diagonalized kernel (6.8) involving generic functions p;, e.g., not necessarily 
independent of each other. One can also remove the compactness hypothesis on the 
input space, e.g., letting 2 be the entire R”. 


Remark 6.2 (Relationship between # and Z” ) Theorem 6.10 points out an inter- 
esting connection between # and -%;'. Since the functions p; form an orthonormal 
basis in %"", one has 


fEeZi = f=} cipi with Yc? <o (6.15) 
ies ies 


while (6.12) shows that 


2 
fer <= f= cp with $ T <o. (6.16) 


ies ieg ~ 


If ¢; > 0 Vi, one has the set inclusion # C G since the functions in the RKHS, 
must satisfy a more stringent condition on the expansion coefficients decay (the ¢; 
decay to zero). 

In addition, let Le denote the operator defined as the square root of Lx, i.e., for 
any f € Zf with f = J;e g Cipi, one has 


LEU = X VGcapi. (6.17) 


ies 


This is a smoothing operator: the function L f] is more regular than f since the 
expansion coefficients ./C;c; decrease to zero faster than the c;. In view of (6.15) 
and (6.16), we obtain 


KH = {UCU | fez}, (6.18) 
which shows that the RKHS can be thought of as the output of the linear system Jaa 
fed with the space 2”, ie, 2 = LY Ze. 


Example 6.11 (Spline kernel expansion) In Example 6.5, we have seen that the 
space of functions on the unit interval satisfying f (0) = 0 and h f2(x)dx < 00 
is the RKHS associated to the first-order spline kernel min(x, y). We now derive a 
representation of the type (6.12) for this space setting to the Lebesgue measure. 
For this purpose, consider the system 


1 
i min(x, y)p(y)dy = Cp(x). 
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The above equation is equivalent to 
x 1 
[vray +x foray = cow. 
0 x 


which implies p(0) = 0. Taking the derivative w.r.t. x we also obtain 


1 
f ply)dy = Cp(x) 
that implies p(1) = 0. Differentiating again w.r.t. x gives 


—p(x) = p(x), 


whose general solution is 


p(x) = asin(x//C) + beos(x//0), a,beER. 


The boundary conditions p(0) = p(1) = 0 imply b = 0 and lead to the following 
possible eigenvalues: 


1 


Seo PS, 
Grr) ' 


Gi 


The orthonormality condition also implies a = 2 so that we obtain 


TX 


pi(x) = V2sin (irx - =), ee 


This provides the formulation (6.12) of the Sobolev space .#. Figure 6.3 plots three 
eigenfunctions (left panel) and the first 100 eigenvalues ¢; (right panel). It is evident 
that the larger i the larger is the high-frequency content of p; and the RKHS norm 
of such basis function. In fact, a large value of i corresponds to a small eigenvalue 
G and one has |Ipill2~ = 1/G. 


Example 6.12 (Translation invariant kernels and Fourier expansion) A translation 
invariant kernel depends only on the difference of its two arguments. Hence, there 
exists h : 2 — R such that K(x, y) = h(x — y). Assume that 2 = [0, 27] and 
that h can be extended to a continuous, symmetric and periodic function over R. 
Then, it can be expanded in terms of the following uniformly convergent Fourier 
series 


h(x) = ) 7 G cos(ix), 
i=0 
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1st-order spline eigenfunctions «ae 1st-order spline eigenfunctions 


“o 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 
x eigenvalue # 


Fig. 6.3 Expansion of the first-order spline kernel min(x, y): eigenfunctions p; fori = 1, 2, 8 (left 
panel) and eigenvalues Q; (right) 


where Co accounts for the constant component and we assume ¢; > 0 Vi. We thus 
obtain the kernel expansion 


K(x, y) = & +J Gcos(ix) cos(iy) + J` G sin(ix) sin(éy), 


i=l i=l 


in terms of functions which are all orthogonal in a. Hence, these kernels induce 
RKHSs generated by the Fourier basis, with different inner products determined 


by G. 


6.3.1 More General Spectral Representation x 


Now, assume that the kernel K is available in the form K (x, y) = Sies Gi pi(x) pi) 
with ¢; > 0 Vi, but with functions p; not necessarily orthonormal. More generally, we 
do not even require that they are independent, e.g., pı could be a linear combination 
of p and p3. The following result shows that the RKHS associated to K is still 
generated by the p;, but the relationship of the expansion coefficients with || - || x2 is 
more involved than in the previous case. 


Theorem 6.13 (RKHS induced by a diagonalized kernel) Let # be the RKHS 
induced by K(x, y) = Xics Gpilx)pi(y) with G > 0Vi and the set % countable. 
Then, # is separable and admits the representation 


He=N\f | F(x) = Yo ciple) st SS 206 (6.19) 


icf ics 
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and one has 


2 
; Cj 

I fllbe = min Do st f=) cipi (6.20) 
“ies > 


ices 
The proof is reported in Sect. 6.9.4 while an application example is given below. 


Example 6.14 Let 
K(x, y) = 2 sin? (x) sin? (y) + 2 cos? (x) cos? (y) + 1. 


Using Theorem 6.13, we obtain that the RKHS # associated to K is spanned 
by sin?(x), cos?(x) and the constant function. Now, let f(x) = 1 and consider the 
problem of computing || f lien To have a correspondence with (6.8) we can, e.g., fix 
the notation 


pi(x) = sin? (x), p(x) = cos(x), p(x) = 1 


and 


Q = 2, G = 2, @G=l. 


Since the functions p; are not independent, many different representation for 
f(x) = 1 can be found. In particular, one has 


l=cp\(x) + cp2(x) + (1 —c)p3(x) Ve eR, 


so that 
2 2 


2 ee Sec 2 : 2 1 
IfI = min aie aa = min 2c =le les 


with the minimum 1/2 obtained at c = 1/2. Hence, according to the norm of .#, 
the “minimum energy” representation of f(x) = 1 is 1/2(p1 (x) + p2 (x) + p3 (x)). 


6.4 Kernel-Based Regularized Estimation 


6.4.1 Regularization in Reproducing Kernel Hilbert Spaces 
and the Representer Theorem 


A powerful approach to reconstruct a function g: 2 —> R from sparse data 
{x;, yi}_, consists of minimizing a suitable functional over a RKHS. An important 
generalization of the estimators based on quadratic penalties, denoted by ReLS-Q in 
Chap. 3, is defined by 
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N 
@ = argmin J HOn FD) + Fle (6.21) 


i=l 
In (6.21), % are loss functions measuring the distance between y; and f (x;). They 
can take only positive values and are assumed convex w.r.t. their second argument 
f(x). As an example, when the quadratic loss is adopted for any i, one obtains 


Vivi, FD = Qi- Fi)’. 


Then, the norm || - || x defines the regularizer, e.g., given by the energy of the first- 
order derivative 


1 
Ife = | Ponds, 


which corresponds to the spline norm introduced in Example 6.5. Finally, the positive 
scalar y is the regularization parameter (already encountered in the previous chapters) 
which has to balance adherence to experimental data and function regularity. Indeed, 
the idea underlying (6.21) is that the predictor g should be able to describe the data 
without being too complex according to the RKHS norm. In particular, the scope of 
the regularizer is to restore the well-posedness of the problem, making the solution 
depend continuously on the data. It should also include our available information on 
the unknown function, e.g., the expected smoothness level. 

The importance of the RKHSs in the context of regularization methods stems 
from the following central result, whose first formulation can be found in [52]. It 
shows that the solutions of the class of variational problems (6.21) admit a finite- 
dimensional representation, independently of the dimension of .#. The proof of an 
extended version of this result can be found in Sect. 6.9.5. 


Theorem 6.15 (Representer theorem, adapted from [104]) Let # be a RKHS. Then, 
all the solutions of (6.21) admit the following expression 


N 
ê=} Krs (6.22) 
i=1 


where the c; are suitable scalar expansion coefficients. 


Thus, as in the traditional linear parametric approach, the optimal function is a 
linear combination of basis functions. However, a fundamental difference is that 
their number is now equal to the number of data pairs, and is thus not fixed a priori. 
In fact, the functions appearing in the expression of the minimizer g are just the 
kernel sections K,, centred on the input data. The representer theorem also conveys 
the message that, using estimators of the form (6.21), it is not possible to recover 
arbitrarily complex functions from a finite amount of data. The solution is always 
confined to a subspace with dimension equal to the data set size. 
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Now, let K € R”*" be the positive semidefinite matrix (called kernel matrix, or 
Gram matrix) such that K;; = K (x;, x;). The ith row of K is denoted by k;. Using 
this notation, if g = = c; K,, then 


g(xj) = kic and ||g\l3¢ =c"Ke, (6.23) 


where c = [c},..., Cy] and the second equality derives from the reproducing prop- 
erty or, equivalently, from (6.4). 

Using the representer theorem, we can plug the expression (6.22) of the optimal 
ê into the objective (6.21). Then, exploiting (6.23), the variational problem (6.21) 
boils down to 


N 
i V.(y;, Ki "Ke. 6.24 
mon 2 (y c) +yc° Ke ( ) 


The regularization problem (6.21) has been thus reduced to a finite-dimensional 
optimization problem whose order N does not depend on the dimension of the original 
space #. In addition, since each loss function %⁄ has been assumed convex, the 
objective (6.24) is convex overall. How to compute the expansion coefficients now 
depends on the specific choice of the ¥;, as discussed in the next section. 


Remark 6.3 (Kernel trick and implicit basis functions encoding) Assume that the 
kernel admits the expansion K (x, y) = 72, Gipi(x)pi(y), G > 0. Then, as dis- 
cussed in Sect. 6.3, any function in # has the representation 


o0 o0 a. 
f= >> api with Ifl =o +. 
i=l j=1 Gi 


Problem (6.21) can then be rewritten using the infinite-dimensional vector a = 
[a] a ...] as unknown: 


N oo 


oo 
â = arg min Dx Vis Y ajpj (Xi) ry 
j=l 


i=l j=l 


A 


I~ 


a. 
- 3 
j 


and an equivalent representation of (6.22) becomes ê = 7<; â;pi. In comparison 
to this reformulation the use of the kernel and of the representer theorem subsumes 
modelling and computational advantages. In fact, through K one needs neither to 
choose the number of basis functions to be used (the kernel can already include in an 
implicit way an infinite number of basis functions) nor to store any basis function in 
memory (the representer theorem reduces inference to solving a finite-dimensional 
optimization problem based on the kernel matrix K). These features are related to 
what is called the kernel trick in the machine learning literature. 
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6.4.2 Representer Theorem Using Linear and Bounded 
Functionals 


A more general version of the representer theorem obtained in [52] can be obtained 
by replacing f(x;) with L;[f], where L; is linear and bounded. In the first part of 
the following result 2 is just required to be Hilbert. In Sect. 6.9.5 we will see how 
Theorem 6.16 can be further generalized. 


Theorem 6.16 (Representer theorem with functionals L;, adapted from [104]) Let 
HE be a Hilbert space and consider the optimization problem 


N 


ê = arg min X hOn LISD + VIF We, (6.25) 
SEH 


i=l 


where each L; : H — R is linear and bounded. Then, all the solutions of (6.25) 
admit the following expression 


N 
ĝ=}_ cm, (6.26) 


i=1 


where the c; are suitable scalar expansion coefficients and each n; € Z is the 
representer of Lj, i.e., for any i and f € Z: 


LIFI = (f mæ. (6.27) 
In particular, if £ is a RKHS with kernel K, each basis function is given by 
n(x) = LilK C, x)]. (6.28) 


The existence of 7; satisfying (6.27) is ensured by the Riesz representation theorem 
(Theorem 6.27). One can also prove that in a RKHS a linear functional L is linear 
and bounded if and only if the function f obtained by applying L to the kernel, i.e., 
f(x) = LIK (x, -)] Vx, belongs to the RKHS. 

Note also that Theorem 6.15 is indeed a special case of the last result. In fact, let 
H be aRKHS and L;[ f] = f (xi) Vi. Then, each L; is linear and bounded and each 
ni becomes the kernel section K,, according to the reproducing property. 


Example 6.17 (Solution using the quadratic loss) Let us adopt a quadratic loss in 
(6.25), i.e., 4 (Yi, Lil fl) = (Yi — Lil fl’. This makes the objective strictly convex 
so that a unique solution exists. To find it, plugging (6.26) in (6.25) and using also 
(6.28), the following quadratic problem is obtained 


IY — Oci? + yc" Oc (6.29) 
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where Y = [y1,..., wl’, ||- || is the Euclidean norm, while the N x N matrix O 


has i, j entry given by 
Oi = (ni Nj) = Li[LjlK]]. (6.30) 


The minimizer ¢ of (6.29) is unique if O is full rank. Otherwise, all the solutions 
lead to the same function estimate in view of the (already mentioned) strict convexity 
of (6.25). In particular, one can always use as optimal expansion coefficients the 
components of the vector 

é=(0+4+ In) 'Y. (6.31) 


In Sect.6.5.1 this result will be further discussed in the context of the so-called 
regularization networks, where one comes back to assume L;[f] = f(x). 


6.5 Regularization Networks and Support Vector Machines 


The choice of the loss % in (6.21) yields regularization algorithms with different 
properties. We will illustrate four different cases below. 


6.5.1 Regularization Networks 


Let us consider the quadratic loss function %(y;, f(x;)) = ee with the residual r; 
defined by r; = y; — f (xi). Such a loss, also depicted in Fig. 6.4 (top left panel), 
leads to the problem 


N 
g=argmin X (yi — FED? + IF Ze. (6.32) 
USA 


i=1 
which is a generalization of the regularized least squares problem encountered in the 
previous chapters. In particular, it extends the estimator (3.58a) based on quadratic 
penalty called ReLS-Q in Chap. 3. The estimator (6.32) is known in the literature as 
regularization network [71] or also kernel ridge regression. The strict convexity of 
the objective (6.32) ensures that the minimizer g not only exists but is also unique 
(this issue is further discussed in the remark at the end of this subsection). 

To find the solution, we can follow the same arguments developed in Example 
6.17, just specializing the result to the case L;[f] = f(x;). We will see that the 
matrix O has just to be replaced by the kernel matrix K. 

As previously done, let Y = [y,..., ynl” and use || - || to indicate the Euclidean 
norm. Then, the corresponding regularization problem (6.24) becomes 
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Quadratic 


e-insensitive 


1 
1 
1 
1 
1 
1 
0 
m 


Fig. 6.4 Loss functions examples: quadratic (top left), Huber with ô = 1 (top right), Vapnik with 
e = 0.5 (bottom left) and Hinge (bottom right). The first three losses are all functions of the residual 
r = y — f(x) while the hinge loss depends on the margin m = yf (x) 


min ||Y — Ke||? + yc? Ke, (6.33) 
ceRN 


which is a finite-dimensional ReLS-Q. After simple calculations, one of the optimal 
solutions” is found to be 
é=(K+ In) "Y, (6.34) 


where Iy is the N x N identity matrix. The estimate from the regularization network 
is thus available in closed form, given by g = pe ĉi K,, with the optimal coefficient 
vector ĉ solving a linear system of equations. 


Remark 6.4 (Regularization network as projection) An interpretation of the reg- 
ularization network can be also given in terms of a projection. In particular, let Z 


2 Similarly to what discussed in Example 6.17, if K is not full rank, the solution of (6.33) is not 
unique. In fact, the minimizers are the sum of (6.34) and the null space of the kernel matrix. However, 
all of them lead to the same function estimate g. 
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be the Hilbert space RY x # (any element is a couple containing a vector v and a 
function f) with norm defined, for any v € R” and f € #, by 


lv, Ail = Iv? +f be, y> 0, Il I| = Euclidean norm. 


Let also S be the (closed) subspace given by all the couples (v, f) satisfying the 
constraint v = [f (x1)... f (xn )]. Then, if g = (Y, 0) where 0 here denotes the null 
function in .#, the projection of g onto S is 


gs = argmin ||g — hllZ 


hes 


=  argmin oe fay + Fe. 
ENEA, SEH i= 


It is now immediate to conclude that gs corresponds to ([8 (x1)... &(xXn)], &) where 
È is indeed the minimizer (6.32), which must thus be unique in view of Theorem 
6.25 (Projection theorem). Note that this interpretation can be extended to all the 
variational problems (6.21) containing losses defined by a norm induced by an inner 
product in R”. 


6.5.2 Robust Regression via Huber Loss x 


As described in Sect. 3.6.1, a shortcoming of the quadratic loss is its sensitivity to 
outliers because the influence of large residuals r; grows quadratically. In presence 
of outliers, one would better use a loss function that grows linearly. These issues 
have been widely studied in the field of robust statistics [51], where loss functions 
such as the Huber’s have been introduced. Recalling (3.115), one has 


r2 


KO, fai) = aie 3), ee 


where we still have r; = y; — f (xi). The Huber loss function with 6 = 1 is shown in 
Fig. 6.4 (top right panel). Notice that it grows linearly and is thus robust to outliers. 
When ô — +00, one recovers the quadratic loss. On the other hand, we also have 
lims.0+ %(r)/6 = |r;| that is the absolute value loss. 


6.5.3 Support Vector Regression x 


Sometimes, it is desirable to neglect prediction errors, as long as they are below a 
certain threshold. This can be achieved, e.g., using the Vapnik’s ¢€-insensitive loss 
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given, forr; = yi — f (xi), by 


este E ril <E 
H Oi, fœ) — rile — | Irl _ E, Ir;| > gr 
The Vapnik loss with ¢ = 0.5 is shown in Fig. 6.4 (bottom left panel). Notice that it 
has a null plateau in the interval [—e, £] so that any predictor closer than € to y; is seen 
as a perfect interpolant. The loss then grows linearly, thus ensuring robustness. The 
regularization problem (6.21) associated with the ¢-insensitive loss function turns 
out 


N 
ê = argmin X` [yi — fle + FIle. (6.35) 
; 


eH izl 
and is called Support Vector Regression (SVR), see, e.g., [37]. The SVR solution, 
given by g = yL ı ĉi K, according to the representer theorem, is characterized by 
sparsity in ĉ, i.e., some components ĉ; are set to zero. This feature is briefly discussed 
below. 

In the SVR case, obtaining the optimal coefficient vector ĉ by (6.24) is not trivial 
since the loss |- |e is not differentiable everywhere. This difficulty can be circum- 
vented by replacing (6.24) with the following equivalent problem obtained consid- 
ering two additional N-dimensional parameter vectors € and &*: 


N 
min i +E) +c" Ke, 6.36 
min LE dieg (6.36) 


subject to the constraints 


y, -kic<e+6, i=1,..., N, 
k;c — yi <e+G, i=1,..., 
&, E > 0, i=1,...,N. 


To see that its minimizer contains the optimal solution ¢ of (6.24), it suffices noticing 
that (6.36) assigns a linear penalty only when |y; — kjc| > €. 

Problem (6.36) is quadratic subject to linear inequality constraints, hence it is 
solvable by standard optimization approaches like interior point methods [64, 108]. 
Calculating the Karush—-Kuhn—Tucker conditions, it is possible to show that the 
condition |y; — k;ĉ| < £ implies ¢; = 0. Indexes i for which ĉ; 4 0 instead identify 
the set of input locations x; called support vectors. 
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6.5.4 Support Vector Classification x 


The three losses illustrated above were originally proposed for regression problems, 
with the output y real valued. When the outputs can assume only two values, e.g., 
1 and —1, a classification problem arises. Here, the scope of the predictor is just to 
separate two classes. This problem can be seen as a special case of regression. In 
particular, even if the output space is binary, consider prediction functions f : 2 > 
R and assume that the input x; is associated to the class 1 if f (x;) > 0 and to the class 
—1 if f(x;) < 0. Let the margin on an example (x;, y;) be m; = yi f (x;). Then, we 
will see that the value of m; is a measure of how well we are classifying the available 
data. One can thus try to maximize the margin but still searching for a function not 
too complex according to the RKHS norm. In particular, we can exploit (6.21) with 
a loss that depends on the margin as described below. 
The most natural classification loss is the 0 — 1 loss defined for any i by 


0, m; > 0 


bmeg ce yi f xi), 


HOi fai) = | 
and depicted in Fig. 6.4 (bottom right panel, dashed line). Adopting it, the first com- 
ponent of the objective in (6.21) returns the number of misclassifications. However, 
the 0 — 1 loss is not convex and leads to an optimization problem of combinatorial 
nature. 

An alternative is the so-called hinge loss [98] defined by 


0, m>l1l 
l—m,m <1?’ 


BO. fi) = I1- yi fæl = | m = yi f x), 


which thus provides a linear penalty when m < 1. Figure 6.4 (bottom right panel, 


solid line) illustrates that it is a convex upper bound on the 0 — 1 loss. The problem 
associated with the hinge loss turns out 


N 


ĝ = argmin X |1- yi fle + AF, (6.37) 
JEH F 


and is called support vector classification (SVC). 
Like in the SVR case, obtaining the optimal coefficient vector by (6.37) is not 


trivial since the hinge loss is not differentiable. But one can still resort to an equivalent 
problem, now obtained considering just an additional parameter vector €: 


N 
min $` & + yc7Ke, (6.38) 


subject to the constraints 
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y(kjc) > 1-6, i=1,...,N, 
& = 0, i = lessa Ne 


As in the SVR case, the optimal solution ĉ is sparse and indexes i for which ĉ; 4 0 
define the support vectors x;. 


6.6 Kernels Examples 


The reproducing kernel characterizes the hypothesis space .#. Together with the 
loss function, it also completely defines the key estimator (6.21) which exploits the 
RKHS norm as regularizer. The choice of K has thus a crucial impact on the ability 
of predicting future output data. Some important RKHSs are discussed below. 


6.6.1 Linear Kernels, Regularized Linear Regression 
and System Identification 


We now show that the regularization network (6.32) generalizes the ReLS-Q problem 
introduced in Chap. 3 which adopts quadratic penalties. The link is provided by the 
concept of linear kernel. 

We start assuming that the input space is 2 = R”. Hence, any input location x 
corresponds to an m-dimensional (column) vector. If P € R”*” denotes a symmetric 
and positive semidefinite matrix, a linear kernel is defined as follows 


K(y,x) =y" Px, (x,y) €R” xR”. 


All the kernel sections are linear functions. Hence, their span defines a finite- 
dimensional (closed) subspace of linear functions that, in view of Theorem 6.1 (and 
subsequent discussion) coincides with the whole #. Hence, the RKHS induced by 
the linear kernel is simply a space of linear functions and, for any g € #, there 
exists a € R” such that 

g(x) = a’ Px = K,(x). 


If P is full rank, letting 0 := Pa, we also have 


lgl = WKallbe = (Ka, Ka). 
= K(a,a) =a’ Pa 
= 6" P'O. 


Now, let us use such .# in the regularization network (6.32). Without using the 
representer theorem, we can plug the representation g(x) = 67 x in the regularization 
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problem to obtain g(x) = ÊT x where 


6 = arg min ||Y — D0? + 707 P—'0, (6.39) 
dER™ 


with the ith row of the regression matrix ® equal to x7. One can see that (6.39) 
coincides with ReLS-Q, with the regularization matrix P which defines the linear 
kernel K and, in turn, the penalty term 67 P710. 

We now derive the connection with linear system identification in discrete time. 
The data set consists of the output measurements {y;}/_,, collected on the time 
instants {ft na p and of the system input u. We can form each input location using 
past input values as follows 


Xi = [u1 Ut,—2 «++ tial > (6.40) 


where m is the FIR order and an input delay of one unit has been assumed. Then, if Y 
collects the noisy outputs, 6 becomes the impulse response estimate. This establishes 
a correspondence between regularized FIR estimation and regularization in RKHS 
induced by linear kernels. 


6.6.1.1 Infinite-Dimensional Extensions * 


In place of 2 = R”, let now 2 C R”, i.e., the input space contains sequences. 
We can interpret any input location as an infinite-dimensional column vector and use 
ordinary notation of algebra to handle infinite-dimensional objects. For instance, if 
x,y € & then xTy = (x, y)2 where (-, -)2 is the inner product in £2. Assume we are 
given a symmetric and infinite-dimensional matrix P such that the linear kernel 


K(y, x)= yT Px 


is well defined over a subset of R° x R”. For example, if P is absolutely summable, 
i.e., >> ij | Pi;| < œo, the kernel is well defined for any input location x € 2 with 
X = by. The kernel section centred on x is the infinite-dimensional column vector 
Px. Following arguments similar to those seen in the finite-dimensional case, one can 
conclude that the RKHS associated to such K contains linear functions of the form 
g(x) =a’ Px with a € X. Roughly speaking, the regularization network (6.32) 
relying on such hypothesis space is the limit of Problem (6.39) for m — oo. To 
compute the solution, in this case it is necessary to resort to the representer theorem 
(6.22). One obtains 


N 
2 = Y Gk = "x 


i=1 


where ĉ is defined by (6.34) and 
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The link with linear system identification follows the same reasoning previously 
developed but x; now contains an infinite number of past input values, i.e., 


T 
Xi = [us-1 Ut —2 Uy,-3-- J. 


With this correspondence, the regularization network now implements regularized 
IIR estimation and 6 contains the impulse response coefficients estimates. In fact, 
note that the nature of x; makes the value ¢(x;) the convolution between the system 
input u and 6 evaluated at t; (with one unit input delay). 

In a more sophisticated scenario, in place of sequences, the input space 2 could 
contain functions. For instance, 2 C Y° where 2° is the space of piecewise con- 
tinuous functions on R+. Thus, each input location corresponds to a continuous 
function x : Rt > R. Given a suitable symmetric function P : Rt x R* > R, a 
linear kernel is now defined by 


K(iy,x)= f y(t) P(t, T)x(T)dtdT. 
RtxRt 


The corresponding RKHS thus contains linear functionals: any f € # maps x 
(which is a function) into R. The solution of the regularization network (6.32) 
equipped with such hypothesis space is 


N 
R(x) = Y GK (x) = | OD) x(r)d7, 
i=l Re 
where ĉ is still defined by (6.34) and 


N 
(Tr) := yâ Í. P(r, t)x;(t)dt. 
i=l 


The connection with linear system identification is obtained by defining 
x(t) = ulti — t), t20 


(if the input u(t) is continuous for t > 0 and causal, the functions x; (t) is piecewise 
continuous, making necessary the assumption 2 C 2°). In this way, each g € 
H represents a different linear system. Furthermore, the regularization network 
(6.32) implements regularized system identification in continuous time and 6 is the 
continuous-time impulse response estimate. The class of kernels which include the 
BIBO stability constraint will be discussed in the next chapter. 
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6.6.2 Kernels Given by a Finite Number of Basis Functions 


Assume we are given an input space 2 and m independent functions p; : 2 —> R. 
Then, we define 


K(x, y) =} pilx)pi(y). 


i=l 


It is easy to verify that K is a positive definite kernel. Recalling Theorem 6.13, 
the associated RKHS coincides with the m-dimensional space spanned by the basis 
functions p;. Each function in # has the representation g(x) = >”, 6;p;(x) and, 
in view of (6.20) and the independence of the basis functions, one has 


m 
2 2 
Igle = >> 6. 
i=1 


Consider now the regularization network (6.32) equipped with such hypothesis space. 
The solution can be computed without using the representer theorem by plugging in 
(6.32) the expression of g as a function of 6. Letting ® € RY*” with B;; = p; (xi), 


we obtain ê = } L; Ê; pi with 
6 = arg min ||Y — ||" + yol. (6.41) 
deR™ 


The solution (6.41) coincides with the ridge regression estimate introduced in 
Sect. 1.2. 


6.6.3 Feature Map and Feature Space x 


Let be a space endowed with an inner product, and assume that a representation 
of the form 


K(x, y) = (6), ¢0)) 7, 0:2 > F, (6.42) 


is available. Then, it follows immediately that K is a positive definite kernel. In this 
context, ¢ is called a feature map, and F the feature space. For instance, to have the 
connection with the kernel discussed in the previous subsection, we can think of ¢ 
as a vector containing m functions. It is defined for any x by 


pi(x) 
p(x) 
d=]. 


Pm (x) 
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so that F = R” with the Euclidean inner product. Then, we obtain 


K(x, y) = (64), 60))2 = 67 OO) = D> pila) pil). 


i=1 


Now, given any positive definite kernel K , Theorem 6.2 (Moore—Aronszajn theorem) 
implies the existence of at least one feature map, namely, the RKHS map dv : 2 > 
HE such that 

æ (x) = Kx, 


where the representation (6.42) follows immediately from the reproducing property. 
These arguments show that K is a positive definite kernel iff there exists at least one 
Hilbert space F and a map ¢: 2 —> F such that K (x, y) = (d(x), pO) F. 
Feature maps and feature spaces are not unique since, by introducing any linear 
isometry I : # — F, one can obtain a representation in a different space: 


K(x, y) = (æ x), pæ O)) æ = (U o bx), I o pæ O) s. 


Now, assume that the kernel admits the decomposition (6.8), i.e., 


Ka, y) =} GaP) 


i=l 
with ¢; > 0 Vi. Then, a spectral feature map of K is 
Qu: B >h 


with 
Pu) = VGA, EX. 


In fact, we have 


(r, HO = D> GOAO) = K(x, y). 


i=1 


It is worth also pointing out the role of the feature map within the estimation sce- 
nario. In many applications, linear functions are not models powerful enough. Ker- 
nels define more expressive spaces by (implicitly) mapping the data into a high- 
dimensional feature space where linear machines can be applied. Then, the use of 
the estimator (6.21) does not require to know any feature map associated to K: the 
representer theorem shows that the only information needed to compute the estimate 
is the kernel matrix, as also discussed in Remark 6.3. 
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6.6.4 Polynomial Kernels 


Another example of kernel is the (inhomogeneous) polynomial kernel [70]. For 
x, y € R”, it is defined by 


K(x,y)=((x,y)2 +e)’, peN, c>0, 


with (-, -)2 to denote the classical Euclidean inner product. As an example, assume 
c = landm = p = 2withx = [xa xp] and y = [ya yb]. Then, one obtains the kernel 
expansion 


K(x, y) = 1 + x2y2 + xpy5 + 2xaxvyaY + 2Xa Ya + 2XpYb, 


of the type (6.8) with the p; (xa, xp) given by all the monomials of degree up to 2, 
i.e., the 6 functions 


2 2 
1, Xas Xp: XaXp, Xa, Xp- 


More in general, if c > 0, the polynomial kernel induces a (’"*”)-dimensional RKHS 
spanned by all possible monomials up to the pth degree. The number of basis function 
is thus finite but exponential in p. This simple example is in some sense opposite to 
that described in Sect. 6.6.2. It shows how a kernel can be used to define implicitly 
a rich class of basis functions. 


6.6.5 Translation Invariant and Radial Basis Kernels 


A kernel is said translation invariant if there exists h : 2 — R such that K (x, y) = 
h(x — y). This class has been already encountered in Example 6.12 where its relation- 
ship with the Fourier basis (in the case of one-dimensional input space) is illustrated. 
A general characterization is given below, see also [80]. 


Theorem 6.18 (Bochner, based on [23]) A positive definite kernel K over X = 
IR¢ is continuous and of the form K(x, y) = h(x — y) if and only if there exists a 
probability measure u and a positive scalar ņ such that: 


Kos =n f cos lee = ya) dno. 


Translation invariant kernels include also the class of radial basis kernels (RBF) 
of the form K (x, y) = h(||x — y||) where || - || is the Euclidean norm [85]. A notable 
example is the so-called Gaussian kernel: 


er) 
Koy) = ep (- 5L), p>0, (6.43) 
p 
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where p denotes the kernel width. This kernel is often used to model functions 
expected to be somewhat regular. Note however that p has an important role in 
tuning the smoothness level. A low value makes the kernel close to diagonal so that 
a low norm can be assigned also to rapidly changing functions. On the other hand, as 
p approaches zero, only functions close to be constant are given a low penalty. This 
is the same phenomenon illustrated in Fig. 6.1. 

Another widely adopted kernel, which induces spaces of functions less regular 
than the Gaussian one, is the Laplacian kernel which uses the Euclidean norm in 
place of the squared Euclidean norm: 


ke G20 
(x, y) = exp | -————_], p> 0. (6.44) 
p 


Differently from the kernels described in the first part of Sect. 6.6.1, as well as in 
Sects. 6.6.2 and 6.6.4, the RKHS associated with any non-constant RBF kernel is 
infinite dimensional (it cannot be spanned by a finite number of basis functions). The 
associated RKHS can be shown to be dense in the space of all continuous functions 
defined on a compact subset .2 C R”. This means that every continuous function 
can be represented in this space with the desired accuracy as measured by the sup- 
norm sup eg | f (x)|. This property is called universality. This does not imply that the 
RKHS induced by a universal kernel includes any continuous function. For instance, 
the Gaussian kernel is universal but it has been proved that it does not contain any 
polynomial, including the constant function [69]. 


6.6.6 Spline Kernels 


To simplify the exposition, let 2” = [0, 1] and let also g¥ denote the jth derivative 
of g, with g := g. Intuitively, in many circumstances an effective regularizer is 
obtained by penalizing the energy of the pth derivative of g, i.e., employing 


1 
f (gP (x)) dx. 


0 


An interesting question is whether this penalty term can be cast in the RKHS theory. 
For p = 1, a positive answer has been given by Example 6.5. Actually, the answer 
is positive for any integer p. In fact, consider the Sobolev space of functions g 
whose first p — 1 derivatives are absolutely continuous and satisfy g(0) = 0 for 
j =0,..., p — 1. The same arguments developed in Example 6.5 when p = 1 can 
be easily generalized to prove that this isa RKHS # with norm 


1 
j 2 
Iele = f (en) ae. 
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The corresponding kernel is the pth-order spline kernel 


1 
K(x, y) =| Gp(x, u)G (y, u)du, (6.45) 
0 
where G, is the so-called Green’s function given by 


uifu> 0 
0 otherwise ` 


Gan 


se ay 


, (u), = | (6.46) 


Note that the Laplace transform of G,(-, 0) is 1/s?. Hence, the Green’s function is 
connected with the impulse response of a p-fold integrator. When p = 1, we recover 
the linear spline kernel of Example 6.5: 


K(x, y) = min{x, y} (6.47) 


whereas p = 2 leads to the popular cubic spline kernel [104]: 


xy min{x, y} (min{x, y}3 


K = 
(x, y) 5 6 


(6.48) 
The linear and the cubic spline kernel are displayed in Fig. 6.2. 

We can use the spline hypothesis space in the regularization problem (6.21). Then, 
from the representer theorem one obtains that the estimate £ is a pth-order smoothing 
spline with derivatives continuous exactly up to order 2p — 2 (the order’s choice is 
thus related to the expected function smoothness). This can be seen also from the 
kernels sections plotted in Fig. 6.2 for p equal to 1 and 2. For p = 2 the (finite) sum 
of kernel sections provides the well-known cubic smoothing splines, i.e., piecewise 
third-order polynomials. 

Spline functions enjoy many numerical properties originally studied in the inter- 
polation scenario. In particular, piecewise polynomials circumvent Runge’s phe- 
nomenon (large oscillations affecting the reconstructed function) which, e.g., arises 
when high-order polynomials are employed [81]. Fit convergence rates are discussed, 
e.g., in [3, 14]. 


6.6.7 The Bias Space and the Spline Estimator 


Bias space As discussed in Sect. 4.5, in a Bayesian setting, in some cases it can be 
useful to enrich # with a low-dimensional parametric part, known in the literature 
as bias space. The bias space typically consists of linear combinations of functions 
{@x}¢_- For instance, if the unknown function exhibits a linear trend, one may 
let m = 2 and ¢,(x) = 1, d(x) = x. Then, one can assume that g is sum of two 
functions, one in # and the other one in the bias space. In this way, the function 
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space becomes # + span{¢1, ..., dm}. Using a quadratic loss, the regularization 
problem is given by 


m 2 
(f 0) = arg "o3 (- f (xi) — Saat) +F (6.49) 


T i=l k=1 


and the overall function estimate turns out è = f +} ô; ox. Note that the expan- 
sion coefficients in 0 are not subject to any penalty term but a low value for m avoids 


overfitting. The solution can be computed exploiting an extended version of the 
representer theorem. In particular, it holds that 


N 
g= Ua Kut) Oude (6.50) 


where, assuming that ® € R”*” is full column rank and Pj; = bjx), 


Ô= (TA'E) ' oT AM! (6.51a) 
ĉ= A7! (x z #6) (6.51b) 
Ae ats (6.51c) 


Remark 6.5 (Extended version of the representer theorem) The correctness of for- 
mulas (6.5 la—6.51c) can be easily verified as follows. Fix 0 to the optimizer 0 in the 
objective present in the rhs of (6. 49). Then, we can use the representer theorem with 
Y replaced by Y — PÔ to obtain f= pas ı Ky, with 


é=A'(y— 96) 
with A indeed given by (6.51c). This proves (6.51b). Using the definition of A this 
also implies 


Y — Kê = 664 qê. 


Now, if we fix f to f , the optimizer Ê is just the least squares estimate of 0 with Y 
replaced by Y — Kĉ. Hence, we obtain 


§ = (670) PTY —Ké). 
Using Y — Kĉ = pĝ + ¥¢ in the expression for 0, we obtain (Toy pT =0. 


Multiplying the lhs and rhs of (6.51b) by (67)! ®T and using this last equality, 
(6.5 1a) is finally obtained. 
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The spline estimator The bias space is useful, e.g., when spline kernels are adopted. 
In fact, the spline space of order p contains functions all satisfying the constraints 


gU (0) = O for j = 0, ..., p — 1. Then, to cope with nonzero initial conditions, one 
can enrich such RKHS with polynomials up to order p — 1. The enriched space is 
H ® span{1, x,..., xP! }, with @ denoting a direct sum, and enjoys the universality 


property mentioned at the end of Sect. 6.6.5. The resulting spline estimator becomes 
a notable example of (6.49): it solves 


N P 2 1 
min ` yi — f(x) -> Ga +f (Pœ dx, (6.52) 
eH,“ 0 
deR? i=l k=1 


whose explicit solution is given by (6.50) setting ¢,(x) = x*~! and ;; = a? 
We consider a simple numerical example to illustrate the estimator (6.52) and the 
impact of different choices of y on its performance. The task is the reconstruction of 


the function g(x) = e%"7), with x € [0, 1], from 100 direct samples corrupted by 


-20 0.2 0.4 0.6 0.8 1 


Fig. 6.5 Cubic spline estimator (6.52) with three different values of the regularization parameter: 
truth (red thick line), noisy data (o) and estimate (black solid line) 
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white and Gaussian noise with standard deviation 0.3. The estimates coming from 
(6.52) with p = 2 and three different values of y are displayed in the three panels 
of Fig.6.5. The cubic spline estimate plotted in the top left panel is affected by 
oversmoothing: the too large value of y overweights the norm of f in the objective 
(6.52), introducing a large bias. Hence, the model is too rigid, unable to describe the 
data. The top right panel displays the opposite situation obtained adopting a too low 
value for y which overweights the loss function in (6.52). This leads to a high variance 
estimator: the model is overly flexible and overfits the measurements. Finally, the 
estimate in the bottom panel of Fig. 6.5 is obtained using the regularization parameter 
optimal in the MSE sense. The good trade-off between bias and variance leads to an 
estimate close to truth. As already pointed out in the previous chapters, the choice of 
can thus be interpreted as the counterpart of model order selection in the classical 
parametric paradigm. 


6.7 Asymptotic Properties x 


6.7.1 The Regression Function/Optimal Predictor 


In what follows, we use ju to indicate a probability measure on the input space 2’. 
For simplicity, we assume that it admits a probability density function (pdf) denoted 
by p,. The input locations x; are now seen as random quantities and p, models 
the stochastic mechanism through which they are drawn from 2°. For instance, in 
the system identification scenario treated in Sect. 6.6.1, each input location contains 
system input values, e.g., see (6.40). If we assume that the input is a stationary 
stochastic process, all the x; indeed follow the same pdf p,. 

Let also Y indicate the output space. Then, pyx denotes the joint pdf on 2 x Y 
which factorizes into py); (y|x)p, (x). Here, py), is the pdf of the output y conditional 
on a particular realization x. 

Let us now introduce some important quantities function of 2°, Y and pyx. Given 
a function f, the least squares error associated to f is defined by 


En(f) = &(y — f@)) = f, y OTSO dxdy. (6.53) 


The following result, also discussed in [33], characterizes the minimizer of Err( f) 
and has connections with Theorem 4.1. 


Theorem 6.19 (The regression function, based on [33]) We have 


Sp = arg min Err(f), 
f 


where f, is the so-called regression function defined by 
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f(x) af yPyjxQlx)dy, xe X. (6.54) 
y 


One can see that the regression function does not depend on the marginal density 
Px but only on the conditional p,),. For any given x, it corresponds to the posterior 
mean (Bayes estimate) of the output y conditional on x. The proof of this fact is 
easily obtained by first using the following decomposition 


Eee) = f= fy) + HA- FOPO addy 
= E(Fp(x) — FOP + EY = fo”)? 
+2 f (ye) ~ Fo) ( fo- LOB ODdy) pe(x)dx 
X 2 


0 


= &(f,(x) — f(x) + &(y — folx))’, 


and then noticing that the very last term is independent of f. 

Theorem 6.19 shows that f, is the best output predictor in the sense that it min- 
imizes the expected quadratic loss (MSE) on a new output drawn from pyx. Now, 
we will consider a scenario where py), (and possibly also py) is unknown and only 
N samples {x;, yi , from pyx are available. We will study the asymptotic proper- 
ties (N growing to infinity) of the regularized approaches previously described. The 
regularization network case is treated in the following subsection. 


6.7.2 Regularization Networks: Statistical Consistency 


Consider the following regularization network 


N ER .))2 
ey = arg min 1 0% Ff (xi) 


CH 


which coincides with (6.32) except for the introduction of the scale factor 1/N in 
the quadratic loss. We have also stressed the dependence of the estimate on the data 
set size N. Our goal is to assess whether gy converges to f, as N — oo using the 
norm || - || ~” defined by the pdf p, as follows 


Ife =f Pods. 


First, details on the data generation process are provided. 
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Data generation assumptions The probability measure u on 2 is assumed to be 
Borel non degenerate. As already recalled, this means that realizations from p, can 
cover entirely 2, without holes. This happens, e.g., when p,(x) > OVx € 2. The 
stochastic processes x; and y; are jointly stationary, with joint pdf pyx. 

The study is not limited to the i.i.d. case. This is important, e.g., in system identifi- 
cation where, as visible in (6.40), input locations contain past input values shifted in 
time, hence introducing correlation among the x;. Let a, b indicate two integers with 
a <b. Then, M denotes the o-algebra generated by (xa, Ya), --.-, (Xp, Yp). The 
process (x, y) is said to satisfy a strong mixing condition if there exists a sequence 
of real numbers Ym such that, Vk, m > 1, one has 


|P(AN B) — P(A)P(B)| < Yi YA € MBE ae, 


with 
lim y = 0. 
1—>>0CoO 


Intuitively, if a, b represent different time instants, this means that the random vari- 
ables tend to become independent as their temporal distance increases. 


Assumption 6.20 (Data generation and strong mixing condition) The probability 
measure u on the input space (having pdf py) is nondegenerate. In addition, the 
random variables x; and y; form two jointly stationary stochastic processes, with 
finite moments up to the third order and satisfy a strong mixing condition. Finally, 
denoting with y; the mixing coefficients, one has 


(oe) 


(Ivil'”) 


isl 


Consistency Result 
The following theorem, whose proof is in Sect. 6.9.6, illustrates the convergence in 
probability of (6.55) to the best output predictor. 


Theorem 6.21 (Statistical consistency of the regularization networks) Let # be a 
RKHS of functions f : X — R induced by the Mercer kernel K, with 2 a compact 
metric space. Assume that f, € A and that Assumption 6.20 holds. In addition, let 


1 
= 6.56 
VX Fa (6.56) 


where a is any scalar in (0, +). Then, as N goes to infinity, one has 
lav — foll! — p 9, (6.57) 


where —~> , denotes convergence in probability. 
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The meaning of (6.56) is the following one. The regularizer || - lee in (6.55) 
restores the well-posedness of the problem by introducing some bias in the estimation 
process. Intuitively, to have consistency, the amount of regularization should decay 
to zero as N goes to oo, but not too rapidly in order to keep the variance term under 
control. This can be obtained making the regularization parameter y go to zero with 
the rate suggested by (6.56). 


6.7.3 Connection with Statistical Learning Theory 


We now discuss the class of estimators (6.21) within the framework of statistical 
learning theory. 


Learning problem Let us consider the problem of learning from examples as defined 
in statistical learning. The starting point is that described in Sect. 6.7.1. There is an 
unknown probabilistic relationship between the variables x and y described by the 
joint pdf pyx on 2 x Y. We are given examples {x;, yi}; of this relationship, 
called training data, which are independently drawn from p,,. The aim of the 
learning process is to obtain an estimator gy (a map from the training set to a space 
of functions) able to predict the output y given any x € 2. 


Generalization and consistency In the statistical learning scenario, the two funda- 
mental properties of an estimator are generalization and consistency. To introduce 
them, first we introduce a loss function V(y, f(x)), called risk functional. Then, 
the mean error associated to a function f is the expected risk given by 


D= f. 10. FE). 9Axay. (6.58) 


Note that, in the quadratic loss case, the expected risk coincides with the error already 
introduced in (6.53). Given a function f, the empirical risk is instead defined by 


1 N 
WA = YO FG). (6.59) 
i=1 


Then, we introduce a class of functions forming the hypothesis space .¥ where the 
predictor is searched for. The ideal predictor, also called the target function, is given 


by? 
fo = argmin I(f). (6.60) 


JEF 


3 Here, and also when introducing empirical risk minimization (ERM), we assume that all the 
introduced minimizers exist. If this does not hold true, all the concepts remain valid by resorting to 
the concept of almost minimizers and almost ERM, with 7 (fo) := inf fez I(f). 
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In general, even when a quadratic loss is chosen, fọ does not coincide with the 
regression function f, introduced in (6.54) since ¥ could not contain fp. 

The concepts of generalization and consistency trace back to [97, 99-101]. Below, 
recall that gy is stochastic since it is function of the training set which contains the 
random variables {x;, yy ie 


Definition 6.3 (Generalization and consistency, based on [102]) The estimator gy 
(uniformly) generalizes if Ve > 0: 


lim sup P {|In (ên) — I (ên)| > e} =0. (6.61) 


N->oo Pyx 


The estimator is instead (universally) consistent if Ve > 0: 


Jim sup P {1 (gw) > I(fo) + e} =0. (6.62) 
>O Pyy 


From (6.61), one can see that generalization implies that the performance on the 
training set (the empirical error) must converge to the “true” performance on future 
outputs (the expected error). The presence of the sup, . is then to indicate that this 
property must hold uniformly w.r.t. all the possible stochastic mechanisms which 
generate the data. Consistency, as defined in (6.62), instead requires the expected 
error of gy to converge to the expected error achieved by the best predictor in F. 
Note that the reconstruction of fo is not required. The goal is that £y be able to mimic 
the prediction performance of fọ asymptotically. Key issues in statistical learning 
theory are the understanding of the conditions on gy, the function class ¥ and the 
loss ¥ which ensure such properties. 


Empirical Risk Minimization 
The most natural technique to determine fo from data is the empirical risk minimiza- 
tion (ERM) approach where the empirical risk is optimized: 


N 
A . . ol 
Ên =argmin Iy(f) = argmin — È VO, F&D). (6.63) 
JEF fez Na 


The study of ERM has provided a full characterization of the necessary and sufficient 
conditions for its generalization and consistency. To introduce them, we first need to 
provide further details on the data generation assumptions. 


Assumption 6.22 (Data generation assumptions) It holds that 


the {x;, y;}§_, are iid. and each couple has joint pdf Pyx; 

the input space 2 is a compact set in the Euclidean space; 

y € Y almost surely with Y a bounded real set; 

the class of functions F is bounded, e.g., under the sup-norm; 
A<V(y, f(x)) < B, for f € F, y € Y, with A, B finite and independent of f 
and y. 
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Note that, if the first four points hold true, in practice any loss function of interest, 
such as quadratic, Huber or Vapnik, satisfies the last requirement. 

We now introduce the concept of V,-dimension [5]. It is a complexity measure 
which extends the concept of Vapnik—Chervonenkis (VC) dimension originally intro- 
duced for the indicator functions. 


Definition 6.4 (V,-dimension, based on [5]) Let Assumption 6.22 hold. The V,- 
dimension of Y in ¥,i.e., of the set Y (y, f(x)), f € F, is defined as the maximum 
number h of vectors (x1, y1), .--, (Xn, Yh) that can be separated in all Qh possible 
way using rules 


Class 1: if Y (yi, fD) = s +7, 
Class 0: if V i, f(a) <s- y 


for f € F and some s > 0. If, for any h, it is possible to find h pairs (x1, y1),..., 
(Xn, Yn) that can be separated in all the 2" possible ways, the V,-dimension of Y in 
is infinite. 


So, the V,-dimension is infinite if, for any data set size, one can always find a 
function f and a set of points which can be separated by f in any possible way. 
Note that the required margin to distinguish the classes increases as y augments. 
This means that the V,-dimension is a monotonically decreasing function of y. 

The following definition deals with the uniform, distribution-free convergence of 
empirical means to expectations for classes of real-valued functions. It is related to 
the so-called uniform laws of large numbers. 


Definition 6.5 (Uniform Glivenko Cantelli class, based on [5]) Let Y denote a space 
of functions 2 —> &, where & is a bounded real set, and let p; denote a generic pdf 
on &. Then, Z is said to be a Uniform Glivenko Cantelli (uGC) class* if 


>e) =o 


It turns out that, under the ERM framework, generalization and consistency are 
equivalent concepts. Moreover, the finiteness of the V,-dimension coincides with the 
concept of uGC class relative to the adopted losses and turns out the necessary and 
sufficient condition for generalization and consistency [5]. This is formalized below. 


Theorem 6.23 (ERM and V,-dimension, based on [5]) Let Assumption 6.22 hold. 
The following facts are then equivalent: 


N 


1 
Wee ee- f Opd 


i=l z 


Ve>0 lim sup P {sup 


l 
N—>co Pz geG 


e ERM (uniformly) generalizes. 


4 Sometimes, the class defined by (6.5) in terms of convergence in probability is called weak uGC 
while almost sure convergence leads to a strong uGC. However, it can be proved that, if Assumption 
6.22 holds true and the function class is the composition of the losses with .¥, the two concepts 
become equivalent. 
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e ERM is (uniformly) consistent. 
e The V,-dimension of V in F is finite for any y > 0. 
e The class of functions Y (y, f (x)) with f € F is uGC. 


In the last point regarding the uGC class, one can follow Definition 6.5 using the 
correspondences 2 = 2 x Y,z = (x, y), Pe = pyx and Z = [A, B]. 


Connection with Regularization in RKHS 

The connection between statistical learning theory and the class of kernel-based 
estimators (6.21) is obtained using as function space F a ball Z, ina RKHS #, 
i.e., 


F =B, := |f e X \iiflle <r} (6.64) 


The ERM method (6.63) becomes 


, . i 
gy = argmin — D7 Vr fOD st Ifl <r (6.65) 
; 


i=1 


which is an inequality constrained optimization problem. Exploiting the Lagrangian 
theory, we can find a positive scalar ~y, function of r and of the data set size N, which 
makes (6.65) equivalent to 


1 N 
ey = arg min — V (Yi, f (xi) + 2 ae li, 
Ên = arg mi DD T q (Ife — 77) 


i=1 


which, apart from constants, coincides with (6.21). The question now is whether 
(6.65) is consistent in the sense of the statistical learning theory. The answer is 
positive. In fact, under Assumption 6.22, it can be proved that the class of functions 
Y in F is uGC if F is uGC. In addition, one sufficient (but not necessary) condition 
for ¥ to be uGC is that Y be a compact set in the space of continuous functions. 
The following important result then holds. 


Theorem 6.24 (Generalization and consistency of the kernel-based approaches, 
based on [33, 65]) Let # be any RKHS induced by a Mercer kernel containing 
functions f : X — R, with 2 a compact metric space. Then, for any r, the ball 
B, is compact in the space of continuous functions equipped with the sup-norm. It 
then comes that B, is uGC and, if Assumption 6.22 holds, the regularized estimator 
(6.65) generalizes and is consistent. 


Theorem 6.24 thus shows that kernel-based approaches permit to exploit flexible 
infinite-dimensional models with the guarantee that the best prediction performance 
(achievable inside the chosen class) will be asymptotically reached. 
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6.8 Further Topics and Advanced Reading 


Basic functional analysis principles can be found, e.g., in [59, 79, 112]. The concept 
of RKHS was developed in 1950 in the seminal works [13, 20]. Classical books on 
the subject are [6, 82, 84]. RKHSs have been introduced within the machine learning 
community in [46, 47] leading, in conjunction with Tikhonov regularization theory 
[21, 96], to the development of many powerful kernel-based algorithms [42, 86]. 

Extensions of the theory to vector-valued RKHSs is described in [62]. This is 
connected to the so-called multi-task learning problem [18, 29] which deals with 
the simultaneous reconstruction of several functions. Here, the key point is that 
measurements taken on a function (task) may be informative w.r.t. the other ones, 
see [16, 40, 68, 95] for illustrations of the advantages of this approach. Multi-task 
learning will be illustrated in Chap. 9 using also a numerical example based on real 
pharmacokinetics data. 

Mercer theorem dates back to [60] which discusses also the connection with inte- 
gral equations, see also the book [50]. Extensions of the theorem to non compact 
domains are discussed in [94]. The first version of the representer theorem appears 
in [52]. It has been then subject of many generalizations which can be found in [11, 
36, 83, 103, 110]. Recent works have also extended the classical formulation to the 
context of vector-valued functions (multi-task learning and collaborative filtering), 
matrix regularization problems (with penalty given by spectral functions of matri- 
ces), matricizations of tensors, see, e.g., [1, 7, 12, 54, 87]. These different types of 
representer theorems are cast in a general framework in [10]. 

The term regularization network traces back to [71] where it is illustrated that a 
particular regularized scheme is equal to a radial basis function network. Support 
vector regression and classification were introduced in [24, 31, 37, 98], see also the 
classical book [102]. Robust statistics are described in [51]. 

The term “kernel trick” was used in [83] while interpretation of kernels as inner 
products in a feature space was first described in [4]. Sobolev spaces are illustrated, 
e.g., in [2] while classical works on smoothing splines are [32, 104]. The important 
spline interpolation properties are described in [3, 14, 22]. 

Polynomial kernels were used for the first time in [70] while an application to 
Wiener system identification can be found in [44], as also discussed later on in Chap. 8 
devoted to nonlinear system identification. An explicit (spectral) characterization of 
the RKHS induced by the Gaussian kernel can be found in [91, 92], while the more 
general case of radial basis kernels is treated in [85]. The concept of universal kernel 
is discussed, e.g., in [61, 90]. 

The strong mixing condition is discussed, e.g., in [107] and [34]. 

The convergence proof for the regularization network relies upon the integral 
operator approach described in [88] in an iid. setting and its extension to the 
dependent case developed in [66] in the Wiener system identification context. For 
other works on statistical consistency and learning rates of regularized least squares 
in RKHS see, e.g., [48, 93, 105, 109, 111]. 
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Statistical learning theory and the concepts of generalization and consistency, in 
connection with the uniform law of large numbers, date back to the works of Vapnik 
and Chervonenkis [97, 99—101]. Other related works on convergence of empirical 
processes are [38, 39, 73]. The concept of V, dimension and its equivalence with 
the Glivenko—Cantelli class is proved in [5], see also [41] for links with RKHS. 
Relationships between the concept of stability of estimates (continuous dependence 
on the data) and generalization/consistency can be found in [63, 72], see also [26] 
for previous work on this subject. Numerical computation of the regularized esti- 
mate (6.21) is discussed in the literature studying the relationship between machine 
learning and convex optimization [19, 25, 77]. In the regularization network case 
(quadratic loss), if the data set size N is large, plain application of a solver with 
computational cost O (N°) can be highly inefficient. Then, one can use approximate 
representations of the kernel function [15, 53], based, e.g., on the Nyström method 
or greedy strategies [89, 106, 113]. One can also exploit the Mercer theorem by just 
using an mth-order approximation of K given by X`; ¢;p;(x)p;(y). The solution 
obtained with this kernel may provide accurate approximations also whenm < N, 
see [28, 43, 67, 114, 115]. Training of kernel machines can be also accelerated by 
using randomized low-dimensional feature spaces [74], see also [78] for insights on 
learning rates. 

In the case of generic convex loss (different from the quadratic), one problem is 
that the objective is not differentiable everywhere. In this circumstance, the powerful 
interior point (IP) methods [64, 108] can be employed which applies damped Newton 
iterations to a relaxed version of the Karush—-Kuhn—Tucker (KKT) equations for 
the objective [27]. A statistical and computational framework that allows their broad 
application to the problem (6.21) for a wide class of piecewise linear quadratic losses 
can be found in [8, 9]. In practice, IP methods exhibit a relatively fast convergence 
behaviour. However, as in the quadratic case, a difficulty can arise if N is very large, 
i.e., it may not be possible to store the entire kernel matrix in memory and this 
fact can hinder the application of second-order optimization techniques such as the 
(damped) Newton method. A way to circumvent this problem is given by the so- 
called decomposition methods where a subset of the coefficients c;, called working 
set, is selected, and the associated low-dimensional sub-problem is solved. In this 
way, only the corresponding entries of the output kernel matrix need to be loaded 
into the memory, e.g., see [30, 56-58]. An extreme case of decomposition method is 
coordinate descent, where the working set contains only one coefficient [35, 45, 49]. 


6.9 Appendix 


6.9.1 Fundamentals of Functional Analysis 


We gather some basic functional analysis definitions and results. 
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Vector Spaces 

We will assume that the reader is familiar with the concept of real vector space V 
(the field is given by the real numbers). Here, we just recall that this is a set whose 
elements are called vectors. The space is closed w.r.t. two operations, called addition 
and scalar multiplication, which satisfy the usual algebraic properties. This means 
that any linear and finite combination of vectors still falls in V. When the vector space 
contains functions g : 2 — R, for any f, g € V and a € R the two operations are 
defined as follows: 


f+eg=h where h(x) = f(x) + 8x) xe X 


and 
af =h where h(x) =af(x) Vxre X. 


Inner Products and Norms 
An inner product on V is the function 


(yi VxV >R 


which is 


1. linear in the first argument 
(av + By, z) = atv, z) + P(Y, z), v,y,zEV a BER, 
2. symmetric (and so also linear in the second argument) 
(v, y) = (Y, v); 


3. positive, in the sense that 
(v,v) > 0 Yv 


with 
Vv =0 = v=0, 


where in the r.h.s. O denotes the null vector. 


Recall also that a norm on V is the nonnegative function 
Ill: V —> Rt 


which satisfies 


1. absolute homogeneity 


lavi = lallvll, veV a@eR; 
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2. the triangle inequality 
lv + yll < [lvl + yll; 


3. null vector condition 
Jv] =O < v=0. 


The norm induced by the inner product (-, -) is given by 
Ivl? = (v, v), 


and it is easy to check that this function indeed satisfies all the three norm axioms 
listed above. One also has the Cauchy—Schwarz inequality 


lv y) < Iv ily 


Finally, recall that both (-, x) with x € V and || - || are examples of continuous func- 
tionals V > R, i.e., if limj>o ||v — v; l| = 0, then 


lim |v; = lvl, lim (vj, x) = (v, x) Yx € V. 
J> J> 


Hilbert and Banach Spaces 

A Hilbert space # is a vector space equipped with an inner product (-, -) which is 
complete w.r.t. to the norm || - || induced by such inner product. This means that, 
given any Cauchy sequence, i.e., a sequence of vectors {g pa such that 


lim ||8m — 8nll = 9, 
m,n—> œ 
there exists g € # such that 
lim ||g — gj|| = 0. 
jroo 


In other words, every Cauchy sequence is convergent. Examples of Hilbert spaces 
used in this book are 


e the classical Euclidean space R” of vectors a = [a, ... am] equipped with the 
classical Euclidean inner product 


m 


(a, b)2 = È aibi 
=] 


sometimes denoted just by (-, -) in the book; 
e the space £, of squared summable real sequences a = [a; a2 ...], i.e., such that 
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J a? < 00, 


oe) 
i=1 


equipped with the inner product 


CO 
(a, b)2 = D> aibi; 
i=l 


e the classical Lebesgue space -2 of functions (where the measure ju is here omitted 
to simplify notation) g : 2 — R which are squared summable w.r.t. the measure 
LL, i.e., such that 


f g(x)du(x) < 0, 
A 


equipped with the inner product still denoted by (-, -)2 but now given by 


(9, fh = Í, g(x) f (x)du(x). 


The spaces reported above are also instances of metric spaces where, for every 
couple of vectors f, g, there is a notion of distance defined by || f — g||. Other metric 
spaces are the Banach spaces. They are normed vector spaces complete w.r.t. the 
metric induced by their norm. Hence, every Hilbert space is a Banach space but the 
converse is not true: this happens when || - || does not derive from an inner product. 
Examples of Banach spaces (whose norm does not derive from an inner product) are 


e the space £; of absolutely summable real sequences a = [a a ...], i.e., such that 


[o0] 
> |a;| < ©, 
i=l 

equipped with the norm 


CO 
lalı = do lail; 
i=1 


e the Lebesgue space ™ of functions g : 2 — R absolutely integrable w.r.t. the 
measure p, i.e., such that 


f lg(x)|dp(x) < oo, 


equipped with the norm 


lel = / EEA 
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e the space £% of bounded real sequences a = [a a2 ...], i.e., such that 


sup |a;| < 00, 
i 


equipped with the norm 
lallo = sup ail; 


l 


e the space @ of continuous functions g : 2 —> R. where 2 is a compact set 
typically in R”, equipped with the sup-norm (also called uniform norm) 


IZlloo = max |g (x)|; 
xed 


e the Lebesgue space 2 of functions g : 2 — R which are essentially bounded 
w.r.t. the measure u, i.e., for any g there exists M such that 


|g(x)| < M almost everywhere in 2 w.r.t. the measure ju, 
equipped with the norm 
lello = inf {M | |g(x)| < M almost everywhere in 2 w.r.t. the measure u}. 


An infinite-dimensional Hilbert (or Banach) space is said to be separable if it 
admits a countable basis {p} per i.e., for any g in the space we can find scalars cj 
such that 

[00] 
A Doevei| =0 
jim Je— en 


j=l 
When such vectors {p;} satisfy also the conditions 

loll =1Yj, (pj; pi) =9 7 #7, 
then the basis is said to be orthonormal. 


Subspaces, Projections and Compact Sets 
A subset S of the vector space V is said to be a subspace if S is itself a vector space 
with the same addition and multiplication operations defined in V. The symbol 


span({p;j}jea) 


denotes the subspace containing all the finite linear combinations of vectors taken 
from the (possibly uncountable) family {p;} jea- 

Given a subspace (or simply a set) S contained in a Hilbert (or Banach) space, we 
define 
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S = S U {all the limits of Cauchy sequences built using vectors in S} . 


Then, S is said to be closed if 


S=S. 
The orthogonal to a subspace S of a Hilbert space is denoted by S+ and defined by 
S+ = {g | (g, f) =0Yf € S}. 


It is easy to prove that S+ is always a closed subspace. 
The following fundamental theorem holds. 


Theorem 6.25 (Projection theorem) Let S be a closed subspace of a Hilbert space 
with norm || - || æ. Then, one has 


e any g € # has a unique decomposition 
g =8s+8s 8gs ES, gu ES; 
© gs is the projection of g onto S, i.e., 


gs =argmin |lg— fllæ; 
fes 


e it holds that 
Igoe = lgs + lgs- lže- 
A set A contained in a Hilbert (or Banach) space with norm || - || is said to be 


compact if, given any sequence {g;} of vectors all contained in A, it is possible to 
extract a subsequence {gx} convergent in A, i.e., there exists g € A such that 


lim ||g — gx, || = 0. 

jroo 
When the space is finite-dimensional, a set is compact iff it is closed and bounded. 
Linear and Bounded Functionals 
Given a Hilbert space # with norm || - || æ, a functional L : # — R is said to be 
bounded (or, equivalently, continuous) if there exists a positive scalar C such that 


ILigll<Cligl«w, Yg EZ. (6.66) 


The following classical theorem holds. 


Theorem 6.26 (Closed graph theorem) Let # be a Hilbert (or Banach) space. Then 
L: # — Ris linear and bounded if and only if the graph of L, i.e., 
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Gr(L) = {(f, LIF) with f € H}, 


is a closed set in the product space # x R. This means that if { f;}*2) is a sequence 


converging to f € # and {LIF converges to y € R, then L| f] = y. 


This other fundamental theorem asserts that every linear and bounded functional 
over .# is in one-to-one correspondence with a vector in #. 


Theorem 6.27 (Riesz representation theorem, based on [76]) Let # be a Hilbert 
space and let L : #2 — R. Then L is linear and bounded if and only there is a 
unique f € H such that 


Lig] = (8, fhe, Vee. (6.67) 


6.9.2 Proof of Theorem 6.1 


First, we derive two lemmas which are instrumental to the main proof. 


Lemma 6.1 Let 
S = span({Ky}xrea)- 


If there exists a Hilbert space # satisfying conditions (6.2) and (6.3), then H is 
the closure of S, i.e., FH = S. 


Proof It comes from condition (6.2) that S is a closed subspace which must belong 
to #. Theorem 6.25 (Projection theorem) then ensures that any function f € # 
can be written as 


f=H=fitfa. RES, faes. 
As for the component fz., using condition (6.3) (reproducing property) we obtain 
fsx) = ( fe, K,) wv = 0, Vx. 


In fact, every kernel section belongs to S$ and is thus orthogonal to every function in 
SŁ. Hence, fz. is the null vector and this concludes the proof. 


Lemma 6.2 Let S = span({Kx}xeg) and define 


m m 


I Fle = >. >. cici K Gi, xj), (6.68) 


i=1 j=l 


where f is a generic element in S, hence admitting representation 
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FO= GKO. 
i=l 
Then, ||- || æ is a well-defined norm in S. 


Proof The reader can easily check that absolute homogeneity and the triangle 
inequality are satisfied by || - ||,z. We only need to prove the null vector condition, 
i.e., that for every f € S one has 


Ifle =0 <=> f=0. 


Now, assume that || f l| 2 = 0 where f(-) = yoh ci Kx,(-). While the coefficients 
{ci}; and the input locations {x;}"., are fixed and define f, let also Cm+ı and 
Xm+1 be an arbitrary scalar and input location, respectively. Define K € R”*” and 
K, e R”tixm+l two matrices with (i, j)-entry given by K (x;, xj). Let also c = 
[ci ... Gm]? and c} = [c1 ... Cm Cm4i]". Note that Kc is the vector which contains 
the values of f on the input locations {x;}?" |. 

Since K is positive definite, it holds that 


ch Ky cy 2 0 Vv (Cm415 Xm+1) E (R x KX). 
In addition, since by assumption 


2 T 
If lle =o Ke = 0, 
it comes that the components of the vector Kc, which are the values of f on {x;}/"_,, 
are all null. Now, we show that f(x) = 0 holds everywhere, also on the generic input 
location x4; € Z. In fact, after simple calculations, one obtains 


m 


T T X 2 
c} Kicy =e Kc + 2 ci K (xi, Xm+1) Cm+1 + K (Xm41, Xm+ WC 41 


i=1 


m 
X ' 2 
=2 ci K (xi, Xm+1) Cm+1 + K (Xm+1, Xm+1)Cm41 


i=1 


2 
= 2 f (Xm+1)Cm+1 + K (Xm+1, Xm+1)Cm41 


Now, assume that f (xm+1) > 0. Then, since the last term on the r.h.s. is infinitesimal 
of order two w.r.t. Cm+1 we can find a negative value for c+; sufficiently close to 
zero such that c{ Kyc4 < 0 which contradicts the fact that K is positive definite. 
If f(%m+1) < 0 we can instead find a positive value for Cm+1ı sufficiently close to 
zero such that ci Kycy < 0, which is still a contradiction. Hence, we must have 
Ff @m41) = 0. Since x,,41 was arbitrary, we conclude that f must be the null function. 
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We now prove Theorem 6.1. Let S = span ({Kx}xe x ) and, forany f, g € Shaving 


representations 
m P 


FOSS GKO, 8O55) dK; O 


isf i=l 


define 


m pP 


(Sgae = > So cid) K Qi, yj). 


i=l j=l 


By Lemma 6.2, it is immediate to check that (-, -) æ is a well-defined inner product 
on S. Then, we now show that the desired Hilbert space is #7” = 5 , where S is the 
completion of S w.r.t. the norm induced by (-, -) . 

Condition (6.2) is trivially satisfied since, by construction, all the kernel sections 
belong to Z. 

As for the condition (6.3), we start checking that it holds over S. Introducing the 
couple of functions in S given by 


m 


FOS GKO, gO = K0), 


i=1 


we have 
m 


(S, Ko = (F 8) = GK aix) = f), 


i=1 


showing that the reproducing property holds in S. Let us now consider the completion 
of S. To this aim, let { f;} be a Cauchy sequence with f; € S Vj. We have 


F — HGO = fi — fi, Kx) | 
Sf = fillw lK, 


where we have used first the reproducing property (since it holds in S) and then the 
Cauchy—Schwarz inequality. We have 


Kyle = lV (Kx, Kx) | = J K(x, x) <q < +00, 


where the scalar q independent of x exists because the kernel K is continuous over 
the compact 2 x 2. Combining the last two inequalities leads to 


Ifia@) — fœ) S sue fix) — HGO Sallfi — fill, (6.69) 


which shows that the convergence in # implies also uniform convergence. In other 
words, if f; > fin Z w.r.t. || - |æ, then f; > f also in the space @ of continuous 
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functions w.r.t. the sup-norm || - loo. Since S C @ and @ is Banach, all the functions 
in the completion of S are continuous, i.e., 4 C @. Furthermore, if f; > f in #, 
one has that for any x € 2 


lim (fi. Ky) 4 = (f, Kx) #; 
jroo 
by the continuity of the inner product. But we also have 


lim (fj, Kx). = a Fj) = fa), 


jrw 


since fj € S Vj, the reproducing property holds in S and convergence in #7 
implies uniform (and, hence, pointwise) convergence. This shows that (f, Ky) æ = 
f(a) Vf € #, i.e., the reproducing property holds over all the space #. 

The last point is the unicity of #. For the sake of contradiction, assume that there 
exists another Hilbert space Y which satisfies conditions (6.2) and (6.3). By Lemma 
6.1, we must have Y = S where the completion of S is w.r.t. the norm || - |g deriving 
from the inner product (-, -)y. Condition (6.3) holds both in # and in Y, so that we 
have 

(Kx, Ks) = K(x, 5) = (Ky, Ks)g, Wx, JEX X X. 


Since the functions in S are finite linear combinations of kernel sections, by the 
linearity of the inner product, the above equality allows to conclude that 


(f 8) =(fsig, VA 8) ES xS. 


Such an equality, together with the uniqueness of limits, implies that the completion 
of S w.r.t. || - || x coincides with the completion w.r.t. || - |g. Hence, # and Y are 
the same Hilbert space and this completes the proof. 


6.9.3 Proof of Theorem 6.10 


It is not difficult to see that (6.12) with the inner product (6.13) is a Hilbert space. In 
addition, using the Mercer theorem, in particular the expansion (6.11), from (6.13) 
one has 


Kile = 10 Gawl 
ic FZ 


2) 
ga ee 


: i 
ie fF 
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and, for any f = È`; y 4ipi, it also holds that 


(Kx, foe = (96 Geil), > api). 
ices ices 
— 5 Gipi ai = f(x). 
ices Gi 


This shows that every kernel section belongs to # and the reproducing property 
holds. Theorem 6.1 then ensures that # is indeed the RKHS associated to K. 


6.9.4 Proof of Theorem 6.13 


First, let # be the RKHS induced by K (x, y) = Cp(x) p(y). Any RKHS is spanned 
by its kernel sections, hence in this case # is the one-dimensional subspace gener- 
ated by p. By the reproducing property it holds that 


IK: = K(x, x) = CP Œ). 


In addition, one has 


IK% = Ieee = EPON, 


so that 


1 
lele = = 


¢ 


Now, consider the kernel of interest K (x, y) = er Gi pi(x)p;() associated with 
H . Define Kj (x, y) = ¢jp;(x)p;(y). with || - ||, to denote the norm induced by 
K j. From the discussion above it holds that 


1 
pile, = =- (6.70) 
; j 


Think of K(x, y) = Dei Ġipi(x)pi (y) as the sum of K;(x, y) and K-;(x, y) = 
wr Cx Px(*) Px). Then, using Theorem 6.6 and (6.70), one has 


2 


C: 
lo; le = min at aloe, s.t. pj =cjpi th, cj €R, he Hj 
Cj, j 


where #42 ; is the RKHS induced by K- j. Evaluating the objective at (c; = 1, h = 0), 
one obtains 
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lejl < ! 
Pj JE a>)? 
iw G 
and this shows that p; € # Vj. 
Now we prove that the functions p; generate all the RKHS .# induced by K. 
Using Theorem 6.25 (Projection theorem), it comes that for any f € # we have 


f=gth with geG,heGt 


where G indicates the closure in # of the subspace generated by all the pp. Using 
the reproducing property, one obtains 


h(x) = < hC), Kœ, ) >æ 


< AO, Yep) > 

k=1 
D QE) < AC), pe) >= 0 Yx, 
k=1 


where the last equality exploits the relation h L pg Vk. This completes the first part 
of the proof. 

As for the RKHS norm characterization, first let He be the RKHS induced by the 
kernel °° j Kx with hj to denote a generic element of H. Then, given f € #, 
using Theorem 6.6 in an iterative fashion, we obtain 


2 


2 s CI 2 
Il flog = min = + |All% s-t. f = cipi + ho 
c,h2 G1 2 
2 2 

gee Ey Cg 2 _ 

= min —+—+|/3\lZeo s.t. f = c1pı topo + h3 
e1.c2,h3 Cy Q2 s 

n—l 9 n—-1 


. Ck 2 

= min So A+ lhlg st f = Y cipi thn. 

ChyssesCn—1 sla Ck n : 

k=1 i=l 

In particular, every equality above is obtained thinking of the kernel Xo% j Kx as 
the sum of K; and eee K;,. Then, hj can be decomposed into two parts, i.e., 
hj =cjpj +hj4i, with lele, = 1/¢; where, as before, || - ||, denotes the normin 
the one-dimensional RKHS induced by K ;. Now, let Cy, . . . , Gn—1, An be the minimiz- 
ers of the last objective (the minimizer can be assumed unique without loss of general- 
ity, just to simplify the exposition) and note that ||⁄, || æ must go to zero asn —> oo. 
Then, it comes that the sequence ¢,,¢2,... characterizing the norm || f lee is 
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indeed Minie} aa a with the {c} subject to the constraints lim, || f — 
pe CePellae = 0. 


6.9.5 Proofs of Theorems 6.15 and 6.16 


We prove the following more general result that embraces as special cases Theorems 
6.15 and 6.16. 


Theorem 6.28 Let # be a Hilbert space. Consider the optimization problem 


me P(Lilf],..., Ly LF], fll) (6.71) 


and assume that 


e problem (6.71) admits at least one solution; 
e each Li : Z — R is linear and bounded; 
e the objective ® is strictly increasing w.rt. its last argument. 


Then, all the solutions of (6.71) admit the following expression 
N 


=} cim, (6.72) 


i=1 


where the ci are suitable scalar expansion coefficients and each n; € Z is the 
representer of Li, i.e., 


LI =(f mæ, Vf eH, i=1,...,N. 
In particular, if H is a RKHS with kernel K, each basis function is given by 
ni(x) = Li[K(-, x)]. 


To prove the above result, let g be a solution of (6.71) and denote with S the 
(closed) subspace spanned by the N representers n; of the functionals Lj, i.e., 


S = span{7,..., Ny}. 
Exploiting Theorem 6.25 (Projection theorem), we can write 


È= ês + ês, 89 €S, By ES. 
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For the sake of contradiction, assume that gs. is different from the null function. 
Then, we have 


(Lill, Lall lle ) = PUM, æ »---s (nv. 8). -ll ) 


= O((m, ĝs + 851) >- (nw, ês +s) y lesly + lls By) 
= O((m, 8s). +++ (nw, 8s)» fl8slag + Ilse We) 


< (M, 8S) >- (nv, sæ + lêslæ ), 


where the last equality exploits the fact that each n; is orthogonal to all the functions 
in S+ while the inequality exploits the assumption that ® is strictly increasing w.r.t. 
its last argument. This contradicts the optimality of g and implies that gs. must be 
the null function, hence concluding the first part of the proof. 

Finally, to prove (6.28) note that, if 7 is a RKHS, one has 


nix) = (ni, Kx) = LilK (C, x)], 


where the first equality comes from the reproducing property, while the second one 
derives from the fact that n; is the representer of L;. 


6.9.6 Proof of Theorem 6.21 


Preliminary Lemmas 
The first lemma, whose proof can be found in [34], states a bound on the correlation 
between two random variables assuming values in a Hilbert space. 


Lemma 6.3 (based on [34]) Let a and b be zero-mean random variables measurable 
with respect to the a-algebras M and Mh and with values in the Hilbert space 0 
having inner product (-, -) æ. Then, it holds that 


IKa, b) 2] < 15M, MoE laly Elbe (6.73) 
where all the expectations above are assumed to exist and 


PMi, Mh)= sup |P(ANB)— P(A)P(B)]. 
AcM, BEMa 


As for the second lemma, first it is useful to introduce the following integral 
operator: 


TULE Í ETA 
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Since the assumptions underlying Theorem (6.9) (Mercer Theorem) hold true, there 
exists a complete orthonormal basis of -£ # denoted by {pi}ic.z, which satisfies 


Lrlpl=Gei, Ef, GQ>G=. 


To simplify exposition, hereby we assume ¢; > 0 Vi. Then, for r > 0, we define the 
operators Ly” and Li as follows 


Eile > Geni (6.74) 
ie Ff 

iii, aa (6.75) 
ief `l 


The function Ly [f] is less regular than f since its expansion coefficients go to zero 
more slowly. Instead, L’ is a smoothing operator since Ç; c; goes to zero faster than 
c; as i goes to infinity. When r = 1/2 we recover the operator i already defined 


in (6.17) which satisfies #7 = LY” Z'. The following lemma holds. 


Lemma 6.4 If L;’ fp € L$ for some 0 <r < 1, letting 


f = arg min If — folly + If ie, (6.76) 
one has 
If = foley SIL Soll ze. (6.77) 


Proof By assumption, there exists g € Z! , say g = Vier dipi, such that f, = Leg 
so that fp = } jeg Qi dipi. Now, we characterize the solution f of (6.76) using 
= J;e g Cipi and optimizing w.r.t. the c;. The objective becomes 


2 

Cc 
Gi = Cd)? F > Aa 
ief ieg ~ 


and setting the partial derivatives w.r.t. each c; to zero, we obtain 


A Clg, 
f=) ĉip, ¢ = ——. (6.78) 
2 G+y 


This implies 


A N: a 
Telme Ci dipi. 
ieF Gi +7 
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ra) F 
y" r) 2r 1/2 
(a5) @ 
y (=) G+ Y ' 


(l-r) : r 
(ss) Ga) a 
G+ giy 


Del =q "lgl =y ILK folz 
J 


If 0 <r < 1, it follows that 


IÊ- folet = as 


ree 
ote 


and this proves (6.77). 


In the proof of the third lemma reported below, the notation S, : #7 — RN indi- 
cates the sampling operator defined by Sy f = [f (x1)... f (xn)]. In addition, S7 
denotes its adjoint, i.e., for any c € RY, it satisfies 


N N 
(F, Soe = (Sf =>) af) =(f >, Kae, 
i=l 


i=1 


where (-, -) is the Euclidean inner product. Hence, one has 


N 
=) GK, YeeR”. 
j=l 


Lemma 6.5 Define 
mO = |x- fD] Kæ) (6.79) 


with Ô defined by (6.76). Then, if n is given by (6.55), one has 


lên — flx < = La Do- ni] 


H 


Proof First, it is useful to derive two useful equalities involving Ô and gy. The first 
one is 


yf = Lrf- f) = En. (6.80) 
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The last equality in (6.80) follows from the definition of L and 7;. The first equality 
can be obtained using the representation f, = )0;<.7 dipi, then following the same 
passages contained in the first part of the previous lemma’s proof to obtain 


f= re dipi, fo - f= aes 


ie SF 


The second result consists of the following alternative expression for gy: 


STS, vl gr 
ay = [2 ae) y, 6.81 
&Nn ( N +41) N (6.81) 


where J denotes the identity operator. To prove it, we will use the equality gy = 
S’(K+NyI nv) | Y which derives from the representer theorem and also the fact 
that, for any vector c € RY, it holds that S,S7c = Ke with K the kernel matrix built 
using [x1 ...xy]. Then, we have 


ST Sx ST = zi 
as +91 ên = 57 (KK + Ny) +NyK+NyIy)")Y 


ST 
N 


T. 
Now, it is also useful to obtain a bound on the inverse of the operator Ii +I. 
Assume that v € # and let u satisfy 


Cre) 
+yl)u=v. 


We take inner products on both sides with u and use the equality (S,S!u,u) æ = 
(S,u, S,u) to obtain 


1 
g Sets Set) + ylleloe = v we < lvl lul. 


One has 


|S fll 


= inf A => (+7) lull < lvl lul. 
SEH SNIF (Az +9) lel 


Thus, we have shown that 


STS, Vilue 1 
(= N +ot)u=y = Ile < yee < z Ple» Yve H. (6.82) 
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Now, it comes from (6.81) that 


nts a (SS sty sts.f 
Be gee et al I 22 Me A. 
bu —f ( N +11) ( N N a) 


Exploiting the equalities 


STY Sis f 1 č . 
: z= ins yf = en, 
N N NS 


which derive from (6.79) and (6.80), respectively, we obtain 


x [STS ee 
gy — f = | 4+ 41 —) i — Ein. 
&n- f ($ +11) wa” [n] 


The use of (6.82) then completes the proof. 


Proof of Statistical Consistency 
Let f be defined by (6.76), i.e., 


f = arg min If — folly + VIF- 
Then, consider the following error decomposition 
lên — foler < If — foleg + ll8w — Île. (6.83) 
The first term IÊ — foll} on the r.h.s. is not stochastic. The assumption f, € H 


ensures that || Lz” fll ge < œ for 0 <r < 1/2. It thus comes from Lemma 6.4 that, 
at least for O < r < 1/2, it holds that 


If — foler <ILE folet < 00. (6.84) 


Now, consider the second term ||gy — a |v. Since the input space (the function 
domain) is compact, and recalling also (6.69), there exists a constant A such that 


lên — flee < Allév — fle. (6.85) 


To obtain a bound for the r.h.s. involving the RKHS norm, consider the stochastic 
function 


NC) = È = Fan K (xi, -), 


already introduced in (6.79). Using the reproducing property, one has 
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2 A 2 
Inle = [x - Fea] Ka. (6.86) 


The function Ô belongs to # and is thus continuous on the compact 2. In 
addition, the kernel K is continuous on the compact 2 x Z and the process {x;, yi} 
has finite moments up to the third order by assumption. Hence, there exists a constant 
B independent of i such that 

E[Inllbe]<B, k= 1,2,3. (6.87) 


We can now come back to ||2y — Îl. From Lemma 6.5, Yy > 0 it holds that 


sÈ (ni — Elm] 


Now, using first Jensen’s inequality and then (6.87), (6.88), Assumption 6.20 and 
(6.73) in Lemma 6.3 (with a and b replaced by n; — &[n;] and n; — &[n;]) one 
| N 


obtains constants C and D such that 
i 2 
E < E i ni] 


— X (ni — EIn) 

15 AA 2 
< ne Li jl (Ellin — lll)? 
C 2 D 
T (Enl + Eml) < V’ 


lav — fle <- (6.88) 


H 


2 


where 7 replaces 7; or 7; when the expectation is independent of i and j. This latter 
result, combined with (6.85) and (6.88), leads to the existence of a constant E such 
that 


A x E 
Eln — files < AElên — fila < —= (6.89) 
8n — file 8n — fl J/N 
that, combined with (6.83) and (6.84), implies that for any 0 < r < 1/2 
r piece E 
Ellen — folet < y ILK follet + A (6.90) 


Hence, when y is chosen according to (6.56), £ llên — foll g} converges to zero as 
N grows to oo. Using the Markov inequality, (6.57) is finally obtained. 
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Chapter 7 A) 
Regularization in Reproducing Kernel geit 
Hilbert Spaces for Linear System 

Identification 


Abstract In the previous parts of the book, we have studied how to handle linear sys- 
tem identification by using regularized least squares (ReLS) with finite-dimensional 
structures given, e.g., by finite impulse response (FIR) models. In this chapter, we 
cast this approach in the RKHS framework developed in the previous chapter. We 
show that ReLS with quadratic penalties can be reformulated as a function estima- 
tion problem in the finite-dimensional RKHS induced by the regularization matrix. 
This leads to a new paradigm for linear system identification that provides also new 
insights and regularization tools to handle infinite-dimensional problems, involving, 
e.g., IIR and continuous-time models. For all this class of problems, we will see that 
the representer theorem ensures that the regularized impulse response is a linear and 
finite combination of basis functions given by the convolution between the system 
input and the kernel sections. We then consider the issue of kernel estimation and 
introduce several tuning methods that have close connections with those related to 
the regularization matrix discussed in Chap. 3. Finally, we introduce the notion of 
stable kernels, that induce RKHSs containing only absolutely summable impulse 
responses and study minimax properties of regularized impulse response estimation. 


7.1 Regularized Linear System Identification 
in Reproducing Kernel Hilbert Spaces 


7.1.1 Discrete-Time Case 


We will consider linear discrete-time systems in the form of the so-called output 
error (OE) models. Data are generated according to the relationship 


y(t) = C(u) +e(t), t=1,...,N, (7.1) 
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where y(t), u(t) and e(t) € R are the system output, the known system input and the 
noise at time instant £ € N, respectively. In addition, G? (q) is the “true” system that 
has to be identified from the input—output samples with q being the time shift operator, 
i.e., gu(t) = u(t + 1). Here, and also in all the remaining parts of the chapter, we 
assume that e is white noise (all its components are mutually uncorrelated). 

In Chap. 2, we have seen that there exist different ways to parametrize G°(q). In 
what follows, we will start our discussions exploiting the simplest impulse response 
descriptions given by FIR models and then we will consider more general infinite- 
dimensional models also in continuous time. We will see that there is a common way 
to estimate them through regularization in the RKHS framework and the representer 
theorem. 


7.1.1.1 FIR Case 


The FIR case corresponds to 


y(t) = Gq, O)u(t) + elt) 


=> gult- k) tet), 0 = [g1 -s 8m", (1.2) 
k=1 
where m is the FIR order, g1, ..., 8m are the FIR coefficients and @ is the unknown 


vector that collects them. Model (7.2) can be rewritten in vector form as follows: 


Y = p0 +E, (1.3) 
where 
Y=[y(1)... yN)]’, E=[e(1) ... e(N)]" 
and 
p = [p(1) ... gp(N)]" 
with 


o7) = [ult — 1) ... u(t — m)]. 


Instead of describing FIR model estimation directly in the regularized RKHS 
framework, let us first recall the ReLS method with quadratic penalty term introduced 
in Chap. 3. It gives the estimate of 0 by solving the following problem: 
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N m 
6 =argmin X (y(t) — $ gult —k))? + yo” P10 (7.4a) 
t=1. k=1 
= arg min ||Y — 66||?+ yo’ P718 (7.4b) 
8 
= (Tp +yP7H ITY (7.4c) 
= PØ (EPT +yIy) tY, (7.4d) 


where the regularization matrix P € R”*” is positive semidefinite, assumed invert- 
ible for simplicity. The regularization parameter y is a positive scalar that, as already 
seen, has to balance adherence to experimental data and strength of regularization. 

Now we show that (7.4) can be reformulated as a function estimation problem 
with regularization in the RKHS framework. For this aim, we will see that the key 
is to use the m x m matrix P to define the kernel over the domain {1, 2, ..., m} x 
{1,2,..., m}. This in turn will define a RKHS of functions g : {1, 2, ..., m} —> R. 
Such functions are connected with the components g; of the m-dimensional vector 
0 by the relation g(i) = g;. So, the functional view is obtained replacing the vector 
0 with the function that maps i into the ith component of 0. 

Let us define a positive semidefinite kernel K : 2 x 2 — R as follows: 


K(i, j) = Pj, ije X = {1,2,...,m}, (7.5) 


where P;; is the (i, j)th entry of the regularization matrix P. It is obvious that K 
is positive semidefinite because P is positive semidefinite. Its kernel sections will 
be denoted by K; with i = 1,...,m and are the columns of P seen as functions 
mapping 2 into R. 

Now, using the Moore—Aronszajn Theorem, illustrated in Theorem 6.2, the kernel 
K reported in (7.5) defines a unique RKHS # such that (K;, g)v = g(i), Y(i, g) € 
(X, H). This is the function space where we will search for the estimate of the FIR 
coefficients. According to the discussion following Theorem 6.2, since there are just 
m kernel sections K; associated to the m columns of P, for any impulse response 
candidate g € .#, there exist m scalars a; such that 


gi) = aj K(i, j) = Pia (7.6) 
j=l 
where P (i, :) is the ith row of P. Since g(i) is the ith component of 0, one has 
0 = Pa. 


By the reproducing property, we also have 
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m m m m 


lgl% = ($ aK; J aKixe =} J ajaKG, D 
j=1 l=1 


j=ll=1 


and this implies 
lglg =O" PO. 


As a result, the ReLS method (7.4) can be reformulated as follows: 


N m 
g = arg min ) O0) — Daou — BY" + vila (7.7) 
ge j=l k=1 


which is a regularized function estimation problem in the RKHS #. 

In view of the equivalence between (7.4) and (7.7), the FIR function estimate 
& has the closed-form expression given by (7.4d). The correspondence is estab- 
lished by g(i) = 6;. We will show later that such closed-form expression can be 
derived/interpreted by exploiting the representer theorem. 


Remark 7.1 x Besides (7.7), there is also an alternative way to reformulate the 
ReLS method (7.4) as a function estimation problem with regularization in the RKHS 
framework. This has been sketched in the discussions on linear kernels in Sect. 6.6.1. 
The difference lies in the choice of the function to be estimated and the choice of the 
corresponding kernel. In particular, in this chapter, we have obtained (7.7) choosing 
the function and the corresponding kernel to be the FIR g and (7.5), respectively. In 
contrast, in Sect. 6.6.1, the RKHS is defined by the kernel 


K(x, y) =x" Py, x,ye X =R” (7.8) 
and contains the linear functions x79, where the input locations x incapsulate m 
past input values. So, using (7.8), the corresponding RKHS does not contain impulse 
responses but functions that represent directly linear systems mapping regressors 
(built with input values) into outputs. 


7.1.1.2 IIR Case 


The infinite impulse response (IIR) case corresponds to 


y(t) = Gq, O)u(t) + e(t) = X gult —k)+ e(t), t=1,...,N (7.9) 
k=1 
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where 0 = [g1,..., Bool’. So, model order m is set to oo and we have to handle 
infinite-dimensional objects. To face the intrinsic ill-posedness of the estimation 
problem, one could think to introduce an infinite-dimensional regularization matrix 
P. But the penalty 67 P710, adopted in (7.4) for the FIR case, would turn out to be 
undefined. So, the RKHS setting is needed to define regularized IIR estimates. The 
first step is to choose a positive semidefinite kernel K : N x N —> R. Then, let # 
be the RKHS associated with K and g € # be the IIR function with g(k) = gx for 
k € N. Finally, the estimate is given by 


N fee) 
g= argmin > ((y(0) -$ gut — K)? + ylelle. (7.10) 
8& t=1 k=1 


One may wonder whether it is possible to obtain a closed-form expression of the HR 
estimate g as in the FIR case. The answer is positive and given by the following 
representer theorem. It derives from Theorem 6.16 reported in the previous chapter 
applied to the case of quadratic loss functions, as discussed in Example 6.17, that 
allows to recover the expansion coefficients of the estimate just solving a linear 
system of equations, see (6.29) and (6.31). Before stating in a formal way the result, 
it is useful to point out the following two facts: 


e in the dynamic systems context treated in this chapter any functional L; present in 
Theorem 6.16 is now applied to discrete-time impulse responses g which lives in 
the RKHS #. Hence, it represents the discrete-time convolution with the input, 
i.e., L; maps g € # into the system output evaluated at the time instant t = i; 
from the discussion after Theorem 6.16, recall also that a linear functional L is 
linear and bounded in # if and only if the function f, defined for any x by 
f@) =L[K(, j], belongs to #. Hence, the condition (7.11) reported below is 
equivalent to assume that the system input defines linear and bounded functionals 
over the RKHS induced by K. 


Theorem 7.1 (Representer theorem for discrete-time linear system identification, 
based on [73, 90]). Consider the function estimation problem (7.10). Assume that 
H is the RKHS induced by a positive semidefinite kernel K : N x N > R and that, 
fort =1,..., N, the functions n, defined by 


n(i) = X K(i, bult — k), ieN (7.11) 
k=1 


are all well defined in Z. Then, the solution of (7.10) is 


N 
aD => an@, ien, (7.12) 
t=1 


where ĉ, is the tth entry of the vector 
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€=(O+yIy) tY (7.13) 


with Y = [y(1),... y(N)]" and with the (t, s)th entry of O given by 


Os = ee K(i, u(t —ku(s — i), t,s =1,...,N. (7.14) 


i=1 k=1 


Theorem 7.1 discloses an important feature of regularized impulse response esti- 
mation in RKHS. The function estimate g has a finite dimensional representation 
that does not depend on the dimension of the RKHS .# induced by the kernel but 
only on the data set size N. 


Example 7.2 (Stable spline kernel for IIR estimation) To estimate high-order FIR 
models, in the previous chapters, we have introduced some regularization matrices 
related to the DC, TC and stable spline kernels, see (5.40) and (5.41). Consider now 
the TC kernel, also called first-order stable spline, with support extended to N x N, 
iê, 

K(i, j) =a™™0D, 0<æ<1, (i,j) eN. (7.15) 


This kernel induces a RKHS that contains IIR models and can be conveniently 
adopted in the estimator (7.10). An interesting question is to derive the structure of 
the induced regularizer HPZ One could connect K with the matrix P entering 
(7.4a) but its inverse is undefined since now P is infinite dimensional. To derive the 
stable spline norm, it is instead necessary to resort to functional analysis arguments. 
In particular, in Sect. 7.7.1, it is proved that 


[oe] 


E — 8)” 
IIglioe 2 saa (7.16) 


an expression that well reveals how the kernel (7.15) includes information on smooth 
exponential decay. When used in (7.10), the resulting IIR estimate balances the data 
fit (sum of squared residuals) and the energy of the impulse response increments 
weighted by coefficients that increase exponentially with time ¢ and thus enforce 
stability. 

Let us now consider a simple application of the representer theorem. Assume 
that the system input is a causal step of unit amplitude, i.e., u(t) = 1 for t > 0 and 
u(t) = 0 otherwise. The functions (7.11) are given by 


m(i)= >> KG, Dult- k), ieN. 


k=1 


For instance, the first three basis functions are 
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ni) = X KG, ku(l — k) = eX 6D 


k=1 
oo 
n2(i) = X KG, k)u(2 k) = grax iD a gmax (i2) 
k=1 
oo 
n3 (i) = So KG, kus _ k) — qmax GD +4 ymax (i2) a ymax (3) 
k=1 


and, in general, one has 


t 
mn (i) = yee, 


k=1 


Hence, any 7; is a well-defined function in the RKHS induced by K, being the sum 
of the first t kernel sections. Then, according to Theorem7.1, we conclude that the 
IIR estimate returned by (7.10) is spanned by the functions {7,}_, with coefficients 
then computable from (7.13). 


Although Theorem 7.1 is stated for the IIR case (7.10), the same result also holds 
for the FIR case (7.7). The only difference is that the series in (7.11) and (7.14) 
have to be replaced by finite sums up to the FIR order m. Then, interestingly, one 
can interpret the regularized FIR estimate (7.4d) in a different way exploiting the 
representer theorem perspective. In particular, one finds O = @ PT while the basis 
functions {7,}/_, are in one-to-one correspondence with the N columns of P®T, 
each of dimension m. 


7.1.2 Continuous-Time Case 


Now, we consider linear continuous-time systems still focusing on the output error 
(OE) model structure. The system outputs are collected over N time instants t;. 
Hence, the measurements model is 


yea) =f elult —t)dt +e(t;), i=1,...,N, (7.17) 
0 


where y(t), u(t) and e(t) are the system output, the known input and the noise at 
time instant £ € R*, respectively, while g°(t), t € R* is the “true” system impulse 
response. 

Similarly to what done in the previous section, we will study how to determine 
from a finite set of input-output data a regularized estimate of the impulse response g? 
in the RKHS framework. The first step is to choose a positive semidefinite kernel K : 
Rt x Rt > R. Itinduces the RKHS # containing the impulse response candidates 
g € #. Then, the linear model can be estimated by solving the following function 
estimation problem: 
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N od J 
a aa = i —t)d os 7.18 
ê arg min J> (y0) f gul- )dr) + ylsi? (7.18) 


i=1 


The closed-form expression of the impulse response estimate g is given by the 
following representer theorem that again derives from Theorem 6.16 and the same 
discussion reported before Theorem 7.1. Note just that now any functional L; entering 
Theorem 6.16 is applied to continuous-time impulse responses g in the RKHS #. 
Hence, it represents the continuous-time convolution with the input, i.e., L; maps 
g € # into the system output evaluated at the time instant f;. 


Theorem 7.3 (Representer theorem for continuous-time linear system identifica- 
tion, based on [73, 90]) Consider the function estimation problem (7.18). Assume 
that H is the RKHS induced by a positive semidefinite kernel K : Rt x Rt > R 
and that, fori = 1,..., N, the functions n; defined by 


ni(s) = f K (s, t)u(t; — t)dt, s eR" (7.19) 
0 


are all well defined in X. Then, the solution of (7.18) is 


N 
a(s) =} Gini(s), s eR (7.20) 


i=1 
where C; is the ith entry of the vector 


€é=(O+yIy) tY (7.21) 


with Y = [y(t1),... y(ty)]” and the (i, j)th entry of O given by 


oe) [0,6] 
oy =f f K(t,s)u(t) —s)u(t; — t)dsdt, i,j =1,...,N. (1.22) 


Example 7.4 (Stable spline kernel for continuous-time system identification) In 
Example 6.5, we introduced the first-order spline kernel min(x, y) on [0, 1] x [0, 1]. 
It describes a RKHS of continuous functions f on the unit interval that satisfy 
Ff (0) = 0 whose squared norm is the energy of the first-order derivative, i.e., 


1 
f (Fœ) dx. (1.23) 


To describe stable impulse responses g, we instead need a kernel defined over the 
positive real axis Rt that induces the constraint g (+00) = 0. A simple way to obtain 
this is to exploit the composition of the spline kernel with an exponential change of 
coordinates mapping R* into [0, 1]. The resulting kernel is called (continuous-time) 
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Fig. 7.1 First-order (top left) and second-order (bottom left) stable spline kernel with some kernel 
sections (right panels) obtained with 6 = 0.5 and centred on 0, 0.5, 1,..., 10 (bottom) 


first-order stable spline kernel. It is given by 
K(s, t) = min(e”, eP!) = e26) os tf ERT, (7.24) 


where 6 > 0 regulates the change of coordinates and, hence, the impulse responses 
decay rate. So, 6 can be seen as a kernel parameter related to the dominant pole of 
the system. 

It is interesting to note the similarity between the kernel (7.15) and the first-order 
stable spline kernel (7.24). By letting a = exp(— £), the sampled version of the first- 
order stable spline kernel (7.24) corresponds exactly to the TC kernel (7.15). Top 
panel of Fig.7.1 plots (7.24) and also some kernel sections: they are all continuous 
and exponentially decaying to zero. Such kernel inherits also the universality property 
of the splines. In fact, its kernel sections can approximate any continuous impulse 
response on all the compact subsets of Rt. 

The relationship with splines permits also to easily achieve one spectral decompo- 
sition of (7.24). In particular, in Example 6.11, we obtained the following expansion 
of the spline kernel: 
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+00 
min(x, y) = ` tipi (x) pi (Y) 
i=l 


with 
1 


_(. Xx 
pi(x) = 2 sin (irx S =) r Gj = (Gx — 2/2?’ 


2 


where all the p; are mutually orthogonal on [0, 1] w.r.t. the Lebesque measure. In 
view of the simple connection between spline and stable spline kernels given by 
exponential time transformations, one easily obtains that the first-order stable spline 
kernel can be diagonalized as follows: 


e-Bmax(s.t) — 5 di (8); (t) (7.25) 


i=1 
with 


1 


: =at K _— 
bit) = pile ™), i = Gx — 1/2)?’ 


(7.26) 


where the ¢; are now orthogonal on [0, +00) w.r.t. the measure u of density Be~?'. 
In Fig. 6.3, we reported the eigenfunctions p; with i = 1, 2, 8 and the eigenvalues ¢; 
for the first-order spline kernel (6.47). For comparison, we now show in Fig. 7.2 the 
corresponding eigenfunctions ¢; of the first-order stable spline kernel (7.24) with 
B = 1 and also the ¢;. While the eigenvalues are the same, differently from the p; 
the eigenfunctions ¢; now decay exponentially to zero. 

Having obtained one spectral decomposition of (7.24), we can now exploit 
Theorem 6.10 to obtain the following representation of the RKHS induced by the 
first-order stable spline kernel: 
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Fig. 7.2 Expansion of the continuous-time first-order stable spline kernel e~8 ™@*@-») with 8 = 1: 
eigenfunctions p;(x) fori = 1, 2, 8 (left panel) and eigenvalues ¢; (right) 
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(oe) 


OO 22 
KH = [s |g) = X cidi(t), t>0, 2 < o}, (7.27) 
ja 


i=1 
and the squared norm of g turns out to be 


o0 2 
Cc: 

Igle =>) —. (7.28) 
jai i 


Now we will exploit the above results to obtain a more useful expression for ||g Ike : 
The deep connection between spline and stable spline kernel implies that these two 
spaces are isometrically isomorphic, i.e., there is an one-to-one correspondence that 
preserves inner products. In fact, we can associate to any stable spline function g(t) in 
HC the spline function f (t) in the space induced by (6.47) such that g(t) = f (e7P'). 
So, g(t) = X72 ciġi (t) implies f(t) = °°, cip: (t) and the two functions have 


indeed the same norm Yri A Now, using (7.23) and (7.28), we obtain 


a 1 o 2 E +00 : gue 
lele = (FO) dt = (g(t)) dt. (7.29) 
0 0 B 


This expression gives insights into the nature of the stable spline space. Compared 
to the classical Sobolev space induced by the first-order spline kernel, the norm 
penalizes the energy of the first-order derivative of g with a weight proportional 
to e*’. Such norm thus enforces all the function in .# to be continuous impulse 
responses decaying to zero at least exponentially. Note also that (7.29) really seems 
the continuous-time counterpart of the norm (7.16) associated to the discrete-time 
stable spline kernel. 

Let us see now how to generalize the kernel (7.24). In Sect. 6.6.6 of the previous 
chapter, we have introduced the general class of spline kernels. Here, we started our 
discussion using the first-order (linear) spline kernel min(x, y) but we have seen that 
higher-order models can be useful to reconstruct smoother functions, an important 
example being the second-order (cubic) spline kernel (6.48). Applying exponential 
time transformations to the splines, the class of the so-called stable spline kernels is 
obtained. For instance, from (6.48), one obtains the second-order stable spline kernel 


e~ B(st+t+max(s,t)) e 3h max(s,f) 


5 5 (7.30) 


The bottom panels of Fig. 7.1 plots (7.30) and also some kernel sections: they expo- 
nentially decay to zero and are more regular than those associated to (7.24). 
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7.1.3 More General Use of the Representer Theorem for 
Linear System Identification x 


Theorems 7.1 and 7.3 are special cases of the more general representer theorem 
involving function estimation from sparse and noisy data. It was reported as The- 
orem 6.16 in the previous chapter. Let us briefly recall it. Its starting point was the 
optimization problem 


N 


g=argmin X` On Lilgl) + yilglle. (7.31) 
SEH i=l 


where ¥ is a loss function, e.g., the quadratic loss adopted in this chapter, and each 
functional L; : # — R is linear and bounded. Then, all the solutions of (7.31) are 
given by 


N 
=} cm, (7.32) 
i=l 


where each n; € # is the representer of L; given by 
ni (t) = LilK C, Ð]. (1.33) 


How to compute the expansion coefficients c; will then depend on the nature of the 
I, as described in Sect. 6.5. 

The estimator (7.31) can be exploited for linear system identification thinking 
of g as an impulse response, using e.g., a stable spline kernel to define #. The 
linear functional L; is then defined by a convolution and returns the system noiseless 
outputs at instant 7;. In particular, in discrete-time one has 


Lilg] =) gb ult; — k), ti=1,..., N (7.34) 
k=1 


while in continuous time, it holds that 


Lilg] = [ g(t)u(t; — t)dt. (7.35) 
0 


When quadratic losses are used, (7.31) becomes the regularization network 
described in Sect. 6.5.1 whose expansions coefficients are available in closed form. 
One has ĉ = (O + y Iy) tY with the (t, s)-entry of the matrix O given by O,; = 
L;[L,[K]], as given by (7.14) in discrete time and by (7.22) in continuous time. The 
use of losses Y; different from quadratic then opens the way also to the definition of 
many new algorithms for impulse response estimation. For example, the use of the 
Vapnik’s €-insensitive loss described in Sect. 6.5.3 leads to support vector regression 
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Fig. 7.3 Discrete-time Laguerre functions 
Laguerre functions of order 0.15 j | l 
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for linear system identification. Beyond promoting sparsity in the coefficients c;, it 
also makes the estimator robust against outliers since penalties on large residuals 
grows linearly. Outliers can be tackled also by adopting the £4 or Huber loss, see 
Sect.6.5.2. A general system identification framework that includes all the convex 
piecewise linear quadratic losses and penalties is, e.g., described in [2]. 

Interestingly, the estimator (7.31) can be conveniently adopted for linear sys- 
tem identification also giving g a different meaning from an impulse response. For 
instance, in system identification there are important IR models that use Laguerre 
functions see e.g., [91, 92] whose z-transform is 


S ee : 
et > J 


Z—-a 


eo nee 

z-a 

They form an orthonormal basis in £2 and some of them are displayed in Fig. 7.3. 
Another option is given by the Kautz basis functions that allow also to include 

information on the presence of system resonances [46]. Using ¢; to denote such basis 

functions, the impulse response model can be written as 


SO =) gibi). 


i=1 


A problem is how to determine the coefficients g; from data. Classical approaches 
use truncated expansions f = aa gigi, with model order d estimated using, e.g., 
Akaike’s criterion, as discussed in Sect.2.4.3, and then determine the g; by least 
squares. An interesting alternative is to let d = +00 and to think that the g; define 
the function g such that g(i) = g;. One can then estimate the coefficients through 
(7.31) adopting a kernel, like TC and stable spline, that includes information on the 
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expansion coefficients’ decay to zero. Working in discrete time, the functionals L; 
entering (7.31) are in this case defined by 


Lilg]= >) 8; J juli — k), 


CO [0.0] 
j=l k=l 


while in continuous time, one has 


Lilel =o ay f Ou -Ddr 
j=l 


7.1.4 Connection with Bayesian Estimation of Gaussian 
Processes 


Similarly to what discussed in the finite-dimensional setting in Sect. 4.9, also the more 
general regularization in RKHS can be given a probabilistic interpretation in terms 
of Bayesian estimation. In this paradigm, the different loss functions correspond to 
alternative statistical models for the observation noise, while the kernel represents 
the covariance of the unknown random signal, assumed independent of the noise. In 
particular, when the loss is quadratic, all the involved distributions are Gaussian. 

We now discuss the connection under the linear system identification perspective 
where the “true” impulse response g? is seen as the random signal to estimate. 
Consider the measurements model 


y(t;) = Lifg°] + elti), i=1,...,N, (7.36) 


where L; is a linear functional of the true impulse response g° defined by convolution 
with the system input evaluated at t;. One has 


[0,6] 
Lilg?] = $ sui — k) 
k=1 
in discrete time and 


ide |= 1 is e°(t)u(t; — t)dt 
0 


in continuous time. So, the impulse response estimators discussed in this chapter can 
be compactly written as 


N 
g =argmin YXO) — Lilgl)? + ville. (7.37) 
LEH i=l 
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where the RKHS # contains functions g : 2 — R with 2 = N in discrete time 
and 2 = R* in continuous time. 

The following result (whose simple proof is in Sect.7.7.2) shows that, under 
Gaussian assumptions on the impulse response and the noise, (7.37) provides the 
minimum variance estimate of g? given the measurements Y = [y(f1),..., y(tw)]’. 


Proposition 7.1 Let the following assumptions hold: 


e the impulse response g° is a zero-mean Gaussian process on 2°. Its covariance 
function is defined by 
&(g°(t)g°(s)) = AK, 5), 


where à is a positive scalar and K is a kernel; 
e the e(t) are mutually independent zero-mean Gaussian random variables with 
variance o°. Moreover, they are independent of g°. 


Let X be the RKHS induced by K, set y = o? /à and define 
N 
$= j ti) — Lile]? ee h 
g = arg min (do ) — Eile)? + rete] 


Then, & is the minimum variance estimator of g° given Y, i.e., 
Elg°OIY] = 8) We &. 


Remark 7.2 The connection between regularization in RKHS and estimation of 
Gaussian processes was first pointed out in [51] in the context of spline regression, 
using quadratic losses, see also [41, 83, 90]. The connection also holds for a wide 
class of losses ¥; also different from quadratic. For instance, in this statistical frame- 
work, using the absolute value loss corresponds to Laplacian noise assumptions. The 
statistical interpretation of an €-insensitive loss in terms of Gaussians with mean 
and variance given by suitable random variables can be found in [79], see also [40, 
67]. For all this kind of noise models, and many others, it can be shown that the 
RKHS estimate g includes all the possible finite-dimensional maximum a posteriori 
estimates of g?, see [3] for details. 


Remark 7.3 The relation between RKHSs and Gaussian stochastic processes, or 
more general Gaussian random fields, is stated by Proposition 7.1 in terms of min- 
imum variance estimators. In particular, since the representer theorem ensures that 
such estimator is sum of a finite number of basis functions belonging to .#, it turns 
out that ĝ belongs to the RKHS induced by the covariance of g? with probability 
one. Now, one may also wonder what happens a priori, before seeing the data. In 
other words, the question is whether realizations of a zero-mean Gaussian process of 
covariance K fall in the RKHS induced by K. If the kernel K is associated with an 
infinite-dimensional #, the answer is negative with probability one, as graphically 
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RKHS # 
induced by K 


Fig. 7.4 The largest space contains all the realizations of a zero-mean Gaussian process of covari- 
ance K. The smallest space is the RKHS .# induced by K, assumed here infinite dimensional. The 
probability that realizations of f fall in the RKHS is zero. Instead, when the assumptions underly- 
ing the representer theorem hold, the realizations of the minimum variance estimator é[ f |Y] are 
contained in # with probability one 


illustrated in Fig. 7.4. While deep discussions can be found in [9, 34, 59, 68], here 
we give just a hint on this fact. Assume that the kernel admits the decomposition 


M 
Kls, t) = Do Gils) Gi) 


i=1 


inducing an M-dimensional RKHS .#. Let the deterministic functions ġ; be inde- 
pendent. Then, we know from Theorem 6.13 that, if f(t) = x 149i (t), then 


M 2 


2 -y oe 
IFIS 2 5," 


Now, think of K as a covariance and let a; be zero-mean Gaussian and independent 
random variables of variance ¢;, i.e., 


ai ~ N (0, &). 


Then, the so-called Karhunen—Loéve expansion of the Gaussian random field f ~ 
AV (0, K), also discussed in Sect. 5.6 to connect regularization and basis expansion 
in finite dimension, is given by 


M 
£0 => apit) 


i=l 


with M possibly infinite and convergence in quadratic mean. The RKHS norm of f 
is now a random variable and, since the a; are mutually independent with Ea? =, 
one has 
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M 2 


2 a; Z Ea? 
elfe =E}, +) e 
i=1 7! i=1 ! 


So, if the RKHS is infinite dimensional, one has M = oo and the expected (squared) 
RKHS norm of the process f diverges to infinity. 


7.1.5 A Numerical Example 


Our goal now is to illustrate the influence of the choice of the kernel on the quality of 
the impulse response estimate using also the Bayesian interpretation of regulariza- 
tion. The example is a simple linear discrete time system in the form of (7.1). Using 
the z-transform, its transfer function is 


y(t) = u(t) +e(t), t=1,...,20. (7.38) 


1 
z(z — 0.85) 


The system’s impulse response is reported in Fig.7.5. The disturbances e(t) are 
independent and Gaussian random variables with mean zero and variance 0.057. For 
ease of visualization, we let the input u(t) be an impulsive signal, i.e., u(0) = 1 and 
u(t) = 0 elsewhere. Thus, the impulse response have to be estimated from 20 direct 
and noisy impulse response measurements. 

We consider a Monte Carlo simulation of 200 runs. At any run, the outputs are 
obtained by generating mutually independent measurement noises. One data set is 
shown in Fig. 7.5. For each of the 200 data sets, we use the regularized IIR estimator 
(7.10). For what regards K : N x N > R,, we will compare the performance of 
three kernels: the Gaussian (6.43), the cubic spline (6.48) and the stable spline (7.15) 
defined, respectively, by 


exp ( (i — Zz), ij min{i, j} (min{i, jp’ , gmaxti./) 
p 2 6 

Recall that the Gaussian and the cubic spline kernel are the most used in machine 

learning to include information on smoothness. The cubic spline estimator could be 

also complemented with a bias space given, e.g., by a linear function, as described 

in Sect. 6.6.7. However, one would obtain results very similar to those described in 

what follows. 

To adopt the estimator (7.10), we need to find a suitable value for the regularization 
parameter y and also for the unknown kernel parameters, i.e., the kernel width p in the 
Gaussian kernel and the stability parameter a for stable spline. As already done, e.g., 
in Sect. 1.2 for ridge regression, an oracle-based procedure is adopted to optimally 
balance bias and variance. The unknown parameters are achieved by maximizing the 
measure of fit defined as follows: 
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Fig. 7.5 The true impulse True impulse response and one data set 
response (thick line) and one | | 
out of the 200 data sets (0) 


0 10 20 30 40 50 
t 
1 
a 2 50 
Kale- EORPA o 1 0 
100] 1 = - , 2 == Rho (7.39) 
30 [gf — 291? 5 


where computation is restricted only to the first 50 samples where, in practice, the 
impulse response is different from zero. This tuning procedure is ideal since it exploits 
the true function g°. It is useful here since it excludes the uncertainty brought by the 
kernel tuning procedure and will fully reveal the influence of the kernel choice on 
the quality of the impulse response estimate. 

The impulse response estimates obtained by the cubic spline, the Gaussian and 
the stable spline kernel are reported in Fig. 7.6. When the cubic spline kernel (6.48) is 
chosen, the impulse response estimates diverge as time goes. This result can be also 
given a Bayesian interpretation where (6.48) becomes the covariance of the stochastic 
process g. Specifically, the cubic spline kernel models the impulse response as 
double integration of white noise. So, impulse responses coefficients are correlated 
but the prior variance increases in time. For stable systems, variability is instead 
expected to decay to zero as t progresses. When the Gaussian kernel (6.43) is chosen, 
quality of the impulse response estimates much improves, but many of them exhibit 
oscillations and the variance of the impulse response estimator is still large. Bayesian 
arguments here show that the Gaussian kernel models g° as a stationary stochastic 
process. Smoothness information is encoded but not the fact that that one expects 
the prior variance to decay to zero. Finally, the impulse response estimates returned 
by the stable spline kernel (7.15) are all very close to the truth. These outcomes 
are similar to those described, e.g., in Example5.4 in Sect.5.5. In particular, even 
if this example is rather simple, it shows clearly that a straightforward application 
of standard kernels from machine learning and smoothing splines literature may 
give unsatisfactory results. Inclusion of dynamic systems features in the regularizer, 
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Fig. 7.6 True impulse 
response (thick line) and 200 
impulse response estimates 
obtained using the cubic 
spline kernel (6.48) (top 
panel), the Gaussian kernel 
(6.43) (middle) and the 
stable spline kernel (bottom). 
The unknown parameters are 
estimated by an oracle that 
maximizes the fit (7.39) for 
each data set 


Cubic spline kernel based estimator 


30 40 50 


Stable spline kernel based estimator 


30 40 50 


266 7 Regularization in Reproducing Kernel Hilbert Spaces ... 


like smooth exponential decay, greatly enhances the quality of the impulse response 
estimates. 


7.2 Kernel Tuning 


As we have seen in the previous parts of the book, the kernels depend on some 
unknown parameters, the so-called hyperparameters. They can, e.g., include scale 
factors, the kernel width of the Gaussian kernel or the impulse response’s decay 
rate in the TC and stable spline kernels. In real-world applications, the oracle-based 
procedure used in the previous section cannot be used. The kernels need instead 
to be tuned from data. Such procedure is referred to as hyperparameter estimation 
and is the counterpart of model order selection in the classical paradigm of sys- 
tem identification. It determines model complexity within the new paradigm where 
system identification is seen as regularized function estimation in RKHSs. This cal- 
ibration step will thus have a major impact on model’s performance, e.g., in terms of 
predictive capability on new data. Due to the connection with the ReLS methods in 
quadratic form, the tuning methods introduced in Chaps. 3 and 4 can be easily applied 
also in the RKHS framework. In particular, let K (7) denote a kernel, where 7 is the 
hyperparameter vector belonging to the set I”. Such vector could also include other 
parameters not present in the kernel, e.g., the noise variance o?. Some calibration 
methods to estimate 7 from data are then reported below. 


7.2.1 Marginal Likelihood Maximization 


The first approach we describe is marginal likelihood maximization (MLM), also 
called the empirical Bayes method in Sect. 4.4. MLM relies on the Bayesian inter- 
pretation of function estimation in RKHS discussed in Sect.7.1.4. Under the same 
assumptions stated in Proposition 7.1, 7 can be estimated by maximum likelihood 


f = arg max p(Y|n), (7.40) 
ner 


with p(Y|7) obtained by integrating out g° from the joint density p(¥ |g®)p(8°|n), 
i.e., 


wm= / p(Ve°)p(e°lndde®. (7.41) 


The probability density p(Y|7) is the marginal likelihood and, hence, (7.40) is called 
the MLM method. 
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Computation of (7.41) is especially simple in our case since our measurements 
model is linear and Gaussian. In fact, in the Bayesian interpretation of regularized 
linear system identification in RKHS, the impulse response g° is a zero-mean Gaus- 
sian process with covariance AK, where A is a positive scale factor. The impulse 
response is also assumed independent of the noises e(t) which are white and Gaus- 
sian of variance o. Recall also the definition of the matrix O, now possibly function 
of n, reported in (7.14) for the discrete-time case, i.e., when 2 = N, and in (7.22) for 
the continuous-time case, i.e., when 2 = R*. The matrix à0 (n) plays an impor- 
tant role in the MLM method since it corresponds to the covariance matrix of the 
noise-free output vector [Li[g°], ..., Ly{g°]]’ and is thus often called the output 
kernel matrix. Then, as also discussed in Sect. 7.7.2, it comes that the vector Y turns 
out to be Gaussian with zero mean, i.e., 


Y ~ N0, Z(n)), 
where the covariance matrix Z (n) is given by 
Zm) = 10 +0°Iy 


with Zy the N x N identity matrix. Here, the vector ņ could, e.g., contain both À and 
a?. One then obtains that the empirical Bayes estimate of n in (7.40) becomes 


i} = arg min YTZ) tY + log det(Z(n)), (7.42) 
ne 


where the objective is proportional to the minus log of the marginal likelihood. 

As discussed in Chap. 4, the MLM method includes the Occam’s razor principle, 
i.e., unnecessarily complex models are automatically penalized, see e.g., [83]. In 
particular, the Occam’s factor arises thanks to the marginalization and it manifests 
itself in the term log det(Z(7)) in (7.42). A simple example can be obtained thinking 
of the behaviour of the objective for different values of the kernel scale factor à. When 
à increases, the model becomes more complex since, under a stochastic viewpoint, 
the prior variance of the impulse response g° increases. In fact, the term Y” Z(n)'Y, 
related to the data fit, decreases since the inverse of Z(7) tends to the null matrix 
(the model has infinite variance and can describe any kind of data). But the Occam’s 
factor increases since det(Z(7)) grows to infinity. In this way, 7 will balance data fit 
and model complexity. 


7.2.1.1 Numerical Example 


To illustrate the effectiveness of MLM, we revisit the example reported in Sect. 1.2. 
The problem is to reconstruct the impulse response reported in Fig. 7.7 (red line) 
from the 1000 input-output data displayed in Fig. 1.2. System input is low pass and 
this makes estimation hard due to ill-conditioning. 
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Fig. 7.7 True impulse 
response (red thick line) and 
impulse response estimates 
obtained by ridge regression 
with hyperparameters 
estimated by an oracle that 
optimizes the fit (top panel), 
and by the stable spline 
kernels of order 1 (middle) 
and 2 (bottom) with 
hyperparameters estimated 
by marginal likelihood 
maximization 
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We will adopt three kernels. Using ô to denote the Kronecker delta, the value 
K (i, j) is defined, respectively, by 


git itmaxt, j) a? max(i, j) 


2 6 


max(i, j) 
ij, Q H 


ô 


The first choice corresponds to ridge regression with the regularizer given by the sum 
of squared impulse response coefficients. The other two are the first- and second-order 
stable spline kernel reported in (7.15) and in (7.30), respectively. More specifically, 
the last kernel corresponds to the discrete-time version of (7.30) with a = eP. 

In Fig. 1.5, we reported the ridge regularized estimate with y chosen by an oracle 
to maximize the fit. To ease comparison with other approaches, such a figure is also 
reproduced in the top panel of Fig. 7.7. The reconstruction is not satisfactory since 
the regularizer does not include information on smoothness and decay. In fact, the 
Bayesian interpretation reveals that ridge regression describes the impulse response 
as realization of white noise, a poor model for stable dynamic systems. This also 
explains the presence of oscillations in the reconstructed profile. 

The middle and bottom panel report the estimates obtained by the stable spline 
kernels with the noise variance and the hyperparameters y, œ tuned from data through 
MLM. Even if no oracle is used, the quality of the impulse response reconstruction 
greatly increases. This is also confirmed by a Monte Carlo study where 200 data 
sets are obtained using the same kind of input but generating new independent noise 
realizations. MATLAB boxplots of the 200 fits for all the three estimators are in 
Fig. 7.8. Here, the median is given by the central mark while the box edges are the 
25th and 75th percentiles. Then, the whiskers extend to the most extreme fits not 
seen as outliers. Finally, the outliers are plotted individually. Average fits are 73.7% 
for ridge, 83.9% for first-order and 90.2% for second-order stable spline. 

In this example, one can see that it is preferable to use the second-order stable 
spline kernel. This is easily explained by the fact that the true impulse response is 
quite regular so that increasing our expected smoothness improves the performance. 


Fig. 7.8 Boxplot of the fits 100 F 
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achieved by ridge regression 90L fe 
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Interestingly, the selection between different kernels, like first- and second-order 
stable spline, can be also automatically performed by MLM, so addressing the prob- 
lem of model comparison described in Sect. 2.6.2. In fact, let s denote an additional 
hyperparameter that may assume only value 0 or 1. Then, we can consider the com- 
bined kernel 


qitsrmax(i,j) Z 


max(i, j) 
+a ( 
sa ( S) 5 5 


and optimize the hyperparameters s, œ and y by MLM. Clearly, the role of s is to 
select one of the two kernels, e.g., if the estimate $ is 0, then the impulse response 
estimate will be given by a second-order stable spline. Applying this procedure to our 
problem, one finds that the second-order stable spline kernel is selected 177 times 
out of the 200 Monte Carlo runs. Obtained fits are shown in Fig. 7.9, their mean is 
88.8%. 


Remark 7.4 Kernel choice via MLM has also connections with selection through 
the concept of Bayesian model probability discussed in Sect.4.11, see also [50]. In 
fact, assume we are given different competitive kernels (covariances) K i and, for a 
while, assume also that all the hyperparameter vectors y/ are known. We can then 
interpret each kernel as a different model. We can also assign a priori probabilities that 
data have been generated by the ith covariance KŻ, hence thinking of any model as a 
random variable itself. If all the kernels are given the same probability, the marginal 
likelihood computed using Kİ becomes proportional to the posterior probability of 
the ith model. This permits to exploit the marginal likelihood to select the “best” 
kernel-based estimate among those generated by the Kİ. When hyperparameters are 
unknown, the marginal likelihoods can be evaluated with each n’ set to its estimate 
ñi . In this case, care is needed since maximized likelihoods define model posterior 
probabilities that do not account for hyperparameters uncertainty. For example, if 
the dimensions of nÍ change with i, the risk is to select a kernel that have many 
parameters and overfits. This problem can be mitigated, e.g., by adopting the criteria 
described in Sect. 2.4.3, e.g., using BIC, we compute 
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i = argmin —2log p(Y|f') + (dim n’) log N, 


where N is the number of available output measurements and dim 7’ is the num- 
ber of hyperparameters contained in the ith model. Note that, when using stable 
spline kernels as in the above example, the BIC penalty is irrelevant since the first- 
and the second-order stable spline estimator contain the same number of unknown 
hyperparameters. 


7.2.2 Stein’s Unbiased Risk Estimator 


The second method is the Stein’s unbiased risk estimator (SURE) method introduced 
in Sect. 3.5.3.2. The idea of SURE is to minimize an unbiased estimator of the risk, 
which is the expected in-sample validation error of the model estimate. In what 
follows, g? is no more stochastic as in the previous subsection but corresponds to a 
deterministic impulse response. Identification data are given by 


Y) = Lilg°?] + elti), i=1,...,N, 


where the e(t;) are independent, with zero mean and known variance o°, and each 
L; is the linear Toneel E by convolutions with the system input evaluated 
at ti. One thus has L;[g° a: 18 %C(k)u(t; — k) in discrete time, where the t; 
assume integer values, a Lil =i. g°(t)u(t; — t)dt in continuous time. The 
N independent validation ee ene y,(t;) are then defined by using the same 
input that generates the identification data but an independent copy of the noises, 
i.e., 


y(t) = Lilg?l + ex(t), i=1,...,N. (7.43) 


So, all the 2N random variables e,(t;) and e(t;) are mutually independent, with zero 
mean and noise variance o°. Consider the impulse response estimator 


N 
h = i ti = Li 2 2 : 
arg min (Zo ) — Lilgl)? + rete 


as a function of the hyperparameter vector n. The predictions of the y,(t;) are then 
given by L;[g] and also depend on 7. The expected in-sample validation error of the 
model estimate £ is then given by the mean prediction error 


EVEin(n) = p> E (y(t) — LiD’, (7.44) 
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where the expectation & is over the random noises e,(t;) and e(t;). Note that the result 
not only depends on 7 but also on the unknown (deterministic) impulse response g¢°. 
So, we cannot compute the prediction error. However, it is possible to derive an unbi- 
ased estimate of it. To obtain this, let Fín) be the (column) vector with components 
L;[&]. The output kernel matrix O (n), already introduced to describe marginal like- 
lihood maximization, then gives the connection between the vector Y containing the 
measured outputs y(t;) and the predictions. In fact, using the representer theorem to 
obtain g, and hence the L;[£], one obtains 


Êm) = OM) (O(n) + vIn) Y. (7.45) 


Following the same line of discussion developed in Sect. 3.5.3.2 to obtain (3.96), we 
can derive the following unbiased estimator of (7.44): 


a 1 A dof(7) 
EVEin(n) = —||¥ — Y M|? +20? , 7.46 
(n) N Il m) + 20 N (7.46) 

where dof(7) are the degrees of the freedom of Y (n) given by 
dof(n) = trace(O(n)(O(n) + yIy)') (7.47) 


that vary from N to 0 as y increases from 0 to oo. 

Note that (7.46) is function only of the N output measurements y (t;). Thus, we can 
then estimate the hyperparameter 7 by minimizing the unbiased estimator EVE; (n) 
of EVE;, (7) to achieve 


dof(7) 
ae (7.48) 


4 pak > 
fy = arg min — || Y — Y (n)||? + 20° 
ner N 


The above formula has the same form of the AIC criterion (2.33) computed assuming 
Gaussian noise of known variance o? except that the dimension m of the model 
parameter 0 is now replaced by the degrees of freedom dof(7). 


7.2.3 Generalized Cross-Validation 


The third approach is the generalized cross-validation (GCV) method. As discussed 
in Sects. 2.6.3 and 3.5.2.3, cross-validation (CV) is a classical way to estimate the 
expected validation error by efficient reuse of the data and GCV is closely related 
with the N-fold CV with quadratic losses. To describe it in the RKHS framework, 
let ¢* be the solution of the following function estimation problem: 
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g* = arg min 3 (v(t) — Lilgh? + ylelle. (7.49) 
geH i=l ik 


So, 8k is the function estimate when the kth datum y (t; ) is left out. As also described, 
e.g., in [90, Chap. 4], the following relation between the prediction error of g and 
the prediction error of ¢* holds: 


ak — Vite) — Lil] 
y(t) — Lile] = I- Hao) (7.50) 


where Hg (n) is the (k, k)th element of the influence matrix 


H(n) = OMO M) + vIn)’. 
Therefore, the validation error of the N-fold CV with quadratic loss function is 


N 


N -L P 2 
Loew- Lal #41) = Gennes saul) . (7.51) 


k=1 k=1 


Minimizing the above equation as a criterion to estimate the hyperparameter 7 leads 
to the predicted residual sums of squares (PRESS) method 


N , y(te) — Lel8l\* 
n = arg miny” (e ; (7.52) 


The above criterion coincides with that derived in (3.80) working in the finite- 
dimensional setting. 

GCV is a variant of (7.52) obtained by replacing each Hg (n), k = 1,..., N, in 
(7.52) with their average. One obtains 


a 2 
Y(tk) — Lil g] ) (7.53) 


ner 1 — trace(H (n))/N 


N 

7 = arg min > ( 
1 

In view of (7.45), one has P 

Y(n) = H (n)Y. 

and, from (7.47) one can see that trace(H (n)) corresponds to the degrees of freedom 

dof(7), i.e., 

trace(H(n)) = dof(n). 


So, the GCV (7.53) can be rewritten as follows: 
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a IY- Êo 

n= argmin = (7.54) 
ner. (1 — dof(n)/N) 

This corresponds to the criterion (3.82) obtained in the finite-dimensional setting. 

Differently from SURE, a practical advantage of PRESS and GCV is that they do 


not require knowledge (or preliminary estimation) of the noise variance o°. 


7.3 Theory of Stable Reproducing Kernel Hilbert Spaces 


In the numerical experiments reported in this chapter, we have seen that regularized 
IIR models based, e.g., on TC and stable splines provide much better estimates of 
stable linear dynamic systems than other popular machine learning choices like the 
Gaussian kernel. The reading key was the inclusion in the identification process of 
information on the decay rate of the impulse response. This motivates the study of 
the class of the so-called stable kernels that enforces the stability constraint on the 
induced RKHS. 


7.3.1 Kernel Stability: Necessary and Sufficient Conditions 


The necessary and sufficient condition for a linear system to be bounded-input— 
bounded-output (BIBO) stable is that its impulse response g € £; for the discrete- 
time case and g € -%4 for the continuous-time case. Here, £; is the space of absolutely 
summable sequences, while 7, contains the absolutely summable functions on Rt 
(equipped with the classical Lebesque measure), i.e., 


oo oo 
X lel <œ VgeE£; and f le(x)|dx < œ Vg e H, (7.55) 
k=1 0 


Therefore, for regularized identification of stable systems the impulse response 
should be searched within a RKHS that is a subspace of £; in discrete time and 
a subspace of %4 in continuous time. This naturally leads to the following definition 
of stable kernels. 


Definition 7.1 (Stable kernel, based on [32, 73]) Let K : 2 x X — R be a pos- 
itive semidefinite kernel and # : 2 — R be the RKHS induced by K. Then, K is 
said to be stable if 


e # C 4 for the discrete-time case where 2 = N; 
e # C Z for the continuous-time case where 2 = RT. 


If a kernel K is not stable, it is also said to be unstable. Accordingly, the RKHS #7 
is said to be stable or unstable if K is stable or unstable. 


Assigned a kernel, the question is now how to assess its stability. For this purpose, 
a direct use of the above definition is often challenging since it can be difficult to 
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understand which functions belong to the associated RKHS. Stability conditions 
directly on K would be instead desirable. One first observation is that, since 
contains all kernel sections according to Theorem 6.2, all of them must be stable. In 
discrete time, this means K (i, -) € £, for all i. However, this condition is necessary 
but not sufficient for stability, a fact which is not so surprising since we have seen in 
Sect. 6.2 that 2 contains also all the Cauchy limits of linear combinations of kernel 
sections. For instance, in Example 6.4, we have seen that the identity kernel K (i, j) = 
6;;, connected with ridge regression but here defined over all N x N, induces £2. Such 
space is not contained in £1. So, the identity kernel is not stable even if each kernel 
section is stable since it contains only one non-null element. 

The following fundamental result can be found in a more general form in [16] and 
gives the desired charactherization of kernel stability. Maybe not surprisingly, we 
will see that the key test spaces are °°, that contains bounded sequences in discrete 
time, and %,, that contains essentially bounded functions in continuous time. The 
proof is reported in Sect. 7.7.3. 


Theorem 7.5 (Necessary and sufficient condition for kernel stability, based on [16, 
32, 73]) Let K: X x X — R be a positive semidefinite kernel with 2 = N or 
X = Rr. Then, 


e one has 


[0,0] 


Hch s >D 


4=1 


lo) 
K(s, tl; 
t=1 


< œ, VIE ly (7.56) 


for the discrete-time case where X = N; 
e one has 


HCL => f f K (s, t)l(t)dt| ds < œ, VIE Ly (7.57) 
0 0 


for the continuous-time case where X = Rt. 


Figure 7.10 illustrates the meaning of Theorem 7.5 by resorting to a simple system 
theory argument. In particular, a kernel can be seen as an acausal linear time-varying 
system. In discrete time it induces the following input—output relationship 


[0,6] 
w= > BAe PH 12 hiss (7.58) 


j=l 


where K;(j) = K(i, j), while u; and y; denote the system input and output at 
instant i. Then, the RKHS induced by K is stable iff system (7.58) maps every 
bounded input {u;}?°, into a summable output {y,;}?°,. Abusing notation, we can 
also see K as an infinite-dimensional matrix with i, j-entry given by K;(j) with u 


and y infinite-dimensional column vectors. Then, using ordinary algebra notation to 
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Fig.7.10 System theoretic interpretation of RKHS stability. The kernel K is associated to an acausal 
linear system. In discrete time, the input-output relationship is given by y; = at K;j(j)u;. Then, 
K is stable iff every bounded input u is mapped into a summable output y 


handle these objects, the input—output relationship becomes y = Ku and the stability 
condition is 
FE Ch, 4 Kuce t Wu € ly. 


In Theorem 7.5, it is immediate to see that including the constraint —1 < l, < 1 Yt 


on the test functions does not have any influence on the stability test. With this 
constraint, one has 


XOK, 0h] <J IK(s,t)| and [Kenna 
t=1 t=1 0 


<f IK(s, pidr 
0 


The following result is then an immediate corollary of Theorem7.5 obtained 
exploiting the above inequalities. It states that absolute summability is a sufficient 
condition for a kernel to be stable. 


Corollary 7.1 (based on [16, 32, 73]) Let K : X x X — R be a positive semidef- 
inite kernel with X = Nor X = R*. Then, 


e one has 


H chat Y IK(s,1)| < 00 (7.59) 


s=1 t=1 


for the discrete-time case where X = N; 
e one has 


HC n= | l |K (s, t)|dtds < co (7.60) 
o Jo 


for the continuous-time case where X = Rt. 


Finally, consider the class of nonnegative-valued kernels K*, i.e., satisfying 
K(s,t) > 0 Ys, t. If a kernel is stable, using as test function /(t) = 1 Vt, one must 


have 
oe) CO 
> K*(s,t)l,| = 5 K*(s,t) < œ 
t=1 t=1 
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in discrete time, and 


if K*(s, tl (t)dt 
0 


CO 
a| K*(s,t)dt < œ 
0 


in continuous time. So, for nonnegative-valued kernels, stability implies (absolute) 
summability of the kernel. But, since we have seen in Corollary 7.1 that absolute 
summability implies stability, the following result holds. 


Corollary 7.2 (based on[16, 32, 73]) Let K*: 2 x X — Rbeapositive semidef- 
inite and nonnegative-valued kernel with X = N or X = R*. Then, 


e one has 


HCl =p k"G.0 <œ (7.61) 


s=1 t=1 


for the discrete-time case where X = N; 
e one has 


HCL, => f f K+(s,t)dtds < o (7.62) 
0 0 


for the continuous-time case where X = Rt. 


As an example, we can now show that the Gaussian kernel (6.43) defined e.g., 
over Rt x Rt is not stable. In fact, it is nonnegative valued and one has 


[ f exp (—(s — 1)?/p) dsdt = œ Yp. 
0 0 


The same holds for the spline kernels (6.45) extended to R* x R" and also for 
translation invariant kernels introduced in Example 6.12, as e.g., proved in [32] using 
the Schoenberg representation theorem. Hence, all of these models are not suited for 
stable impulse response estimation. 


Remark 7.5 Any unstable kernel can be made stable simply by truncation. More 
specifically, let K : 2 x 2 — R be an unstable kernel with 2 = Nor 2 = Ry’. 
Then by setting K (s, t) = 0 for s,t > T for any given T € 2, a stable kernel is 
obtained. Care should be however taken when a FIR model is obtained through 
this operation. In fact, consider e.g., the use of cubic spline or Gaussian kernel in 
the estimation problem depicted in Fig.7.6 setting T equal to 20 or 50. Also after 
truncation, such models would not give good performance: the undue oscillations 
affecting the estimates in the top and middle panel of Fig. 7.6 would still be present. 
The reason is that these two kernels do not encode the information that the variability 
of the impulse response decreases as time progresses, as also already discussed using 
the Bayesian interpretation of regularization. 
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7.3.2 Inclusions of Reproducing Kernel Hilbert Spaces 
in More General Lebesque Spaces x 


We now discuss the conditions for a RKHS to be contained in the spaces Y,' equipped 
with a generic measure u. The following analysis will then include both the space 
Z (considered before with the Lebesque measure) and £; as special cases obtained 
with p = 1. First, we need the following definition. 

Definition 7.2 (based on [16]) Let 1 < p < œ and q = rar with the convention 
a = œ if p = 1 and ="; =1 if p = 00. Moreover, let K : Z x X —> Rbea 
positive semidefinite kernel. Then, the kernel K is said to be g-bounded if 


1. the kernel section K, € Yp' for almost all s € Z, i.e., for every s € 2 except 
on a set of null measure w.r.t. 1; 
2. the function i K(s, Hl (t)du(t) € GZ, Vie Ze. 


The following theorem then gives the necessary and sufficient condition for the 
q-boundedness of a kernel and is a special case of Proposition 4.2 in [16]. 


Theorem 7.6 (based on [16]) Let K : 2 x X — R bea positive semidefinite ker- 
nel with H the induced RKHS. Then, # is a subspace of Zp if and only if K is 
q-bounded, i.e., 

H CL <== K is q-bounded. 


Theorem 7.6 permits thus to see if a RKHS is contained in Z} by checking the 
properties of the kernel. Interestingly, setting p = 1, that implies q = ov, and u e.g., 
to the Lebesque measure one can see that the concept of stable and oo-bounded 
kernel are equivalent. Theorem 7.5 is then a special case of Theorem 7.6. 


7.4 Further Insights into Stable Reproducing Kernel 
Hilbert Spaces x 


In this section, we provide some additional insights into the structure of the stable 
kernels and associated RKHSs. The analysis is focused on the discrete-time case 
where the kernel K can be seen as an infinite-dimensional matrix with the (i, j)- 
entries denoted by K;;. Thus, the function domain is the set of natural numbers N 
and the RKHS contains discrete-time impulse responses of causal systems. 

As discussed after (7.58) to comment Fig.7.10, the kernel K can be also asso- 
ciated with an acausal linear time-varying system, often called kernel operator in 
the literature. It maps the infinite-dimensional input (sequence) u into the infinite- 
dimensional output Ku whose ith component is 2i K;;u;. Two important kernel 
operators will be considered. The first one maps £» into £1 and is key for kernel 
stability as pointed out in Theorem7.5. The second one maps £2 into £2 itself and 
will be important to discuss spectral decompositions of stable kernels. 
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7.4.1 Inclusions Between Notable Kernel Classes 


To state some relationships between stable kernels and other fundamental classes, 
we start introducing some sets of RKHSs. Define 


e the set .Y, that contains all the stable RKHSs; 
e the set .“ with all the RKHSs induced by absolutely summable kernels, i.e., 
satisfying 
>> [Kyl < +00; 


ij 


e the set 7f, of RKHSs induced by finite-trace kernels, i.e., satisfying 
> Kii < +00; 
i 
e the set %2 associated to squared summable kernels, i.e., satisfying 
2 
yy K; G < +00. 
ij 


One has then the following result from [8] (see Sect. 7.7.4 for some details on its 
proof). 
Theorem 7.7 (based on [8]) Jt holds that 


ACA C SH CS. (7.63) 


Figure 7.11 gives a graphical description of Theorem7.7 in terms of inclusions of 
kernels classes. Its meaning is further discussed below. 

In Corollary 7.1, we have seen that absolute summability is a sufficient condition 
for kernel stability. The result .% C .% shows also that such inclusion is strict. 
Hence, one cannot conclude that a kernel is unstable from the sole failure of absolute 
summability. 

The fact that .% C S pı means that the set of finite-trace kernels contains the stable 
class. This inclusion is strict, hence the trace analysis can be used only to show that 
a given RKHS is not contained in £1. There are however interesting consequences 
of this fact. Consider all the RKHSs induced by translation invariant kernels 


Ki; = h(i — j), 


where h satisfies the positive semidefinite constraints. The trace of these kernels is 
a Ki, = $; h(0) and it always diverges unless h is the null function. So, all the 
translation invariant kernels are unstable (as already mentioned after Corollary 7.2). 
Other instability results become also immediately available. For instance, all the 
kernels with diagonal elements satisfying K;; x i~* are unstable if 6 < 1. 
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Absolutely 
summable 


Fig. 7.11 Inclusion properties of some important kernel classes 


Finally, the strict inclusion Zp, C .%2 shows that the finite-trace test is more 
powerful than a check of kernel squared summability. 


7.4.2 Spectral Decomposition of Stable Kernels 


As discussed in Sect.6.6.3 and in Remark 6.3, kernels can define spaces rich of 
functions by (implicitly) mapping the space of the regressors into high-dimensional 
feature spaces where linear estimators can be used. This allows to reduce nonlinear 
algorithms even without knowing explicitly the feature map, i.e., without the exact 
knowledge of which functions are encoded in the kernel. In particular, in Sect. 6.3, 
we have seen that if the kernel admits the spectral representation 


K(x, y=} Gipilx)pi(y), (7.64) 


i=1 


then the ;(x) are the basis functions that span the RKHS induced by K. For 
instance, the basis functions o;(x) = 1, p2(x) = x, p3(x) = x”, ... describe poly- 
nomial models which are, e.g., included up to a certain degree in the polynomial 
kernel discussed in Sect.6.6.4. Now, we will see that stable kernels always admit 
an expansion of the type (7.64) with the p; forming a basis of £2. The number of ¢; 
different from zero then corresponds to the dimension of the induced RKHS. 
Formally, it is now necessary to consider the operator induced by a stable kernel K 
as a map from £% into £% itself. Again, it is useful to see K as an infinite-dimensional 
matrix so that we can think of Kv as the result of the kernel operator applied to 
v € l2. An operator is said to be compact if it maps any bounded sequence {v;} 
into a sequence {Kv;} from which a convergent subsequence can be extracted [85, 
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95]. From Theorem 7.7, we know that any stable kernel K is finite trace and, hence, 
squared summable. This fact ensures the compactness of the kernel operator, as 
discussed in [8] and stated below. 


Theorem 7.8 (based on [8]) Any operator induced by a stable kernel is self-adjoint, 
positive semidefinite and compact as a map from £ into £2 itself. 


This result allows us to exploit the spectral theorem [35] to obtain an expansion 
of K. Now, recall that spectral decompositions were discussed in Sect. 6.3 where the 
Mercer’s theorem was also reported. Mercer’s theorem derivations exploit the spectral 
theorem and, as, e.g., in Theorem 6.9, they typically assume that the kernel domain 
is compact, see also [86] for discussions and extensions. Indeed, first formulations 
consider continuous kernels on compact domains (proving also uniform convergence 
of the expansion). However, the spectral theorem does not require the domain to be 
compact and, when applied to discrete-time kernels on N x N, it guarantees pointwise 
convergence. It thus becomes the natural generalization of the decomposition of a 
symmetric matrix in terms of eigenvalues and eigenvectors, initially discussed in the 
finite-dimensional setting in Sect. 5.6 to link regularization and basis expansion. This 
is summarized in the following proposition that holds in virtue of Theorem 7.8. 


Proposition 7.2 (Representation of stable kernels, based on [8]) Assume that the 
kernel K is stable. Then, there always exists an orthonormal basis of £2 composed 
by eigenvectors {pi} of K with corresponding eigenvalues {¢;}, i.e., 


Kp; = ¢i pi, (ao ae pare 


In addition, the kernel admits the following expansion: 


+00 
Ko = >) Goa), (7.65) 


i=1 
with x, y € N. 


While in the next subsection, we will use the above theorem to discuss the repre- 
sentation of stable RKHSs, some numerical considerations regarding (7.65) are now 
in order. Under an algorithmic viewpoint, many efficient machine learning proce- 
dures use truncated Mercer expansions to approximate the kernel, see [42, 52, 75, 93, 
96] for discussions on their optimality in a stochastic framework. Applications for 
system identification can be found in [15] where it is shown that a relatively small 
number of eigenfunctions (w.r.t. the data set size) can well approximate impulse 
responses regularized estimates. These works trace back to the so-called Nyström 
method where an integral equation is replaced by finite-dimensional approximations 
[5, 6]. However, obtaining the Mercer expansion (7.65) in closed form is often hard. 
Fortunately, the £2 basis and related eigenvalues of a stable RKHS can be numeri- 
cally recovered (with arbitrary precision w.r.t. the £2 norm) through a sequence of 
SVDs applied to truncated kernels [8]. Formally, let K denote the d x d positive 
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Fig. 7.12 Expansion of the first-order discrete-time stable spline kernel Kxy = amex) with 
a = 0.99: eigenfunctions p;(x) orthogonal in £ for i = 1, 2,8 (left panel, samples are linearly 
interpolated) and eigenvalues ¢; (right) 


semidefinite matrix obtained by retaining only the first d rows and columns of K. 
Let also pa and ga be, respectively, the eigenvectors of K, seen as elements of 
la with a tail of zeros, and the eigenvalues returned by the SVD of K. Assume, 
for simplicity, single multiplicity of each ¢;. Then, for any i, as d grows to oo one 
has 


6 > & (7.66a) 


los” — pills > 0, (7.66b) 


where || - ||2 is the £2 norm. 

In Fig.7.12, we show some eigenvectors (left panel) and the first 100 eigen- 
values (right) of the stable spline kernel Kyy = &™®* œ») with a = 0.99. Results 
are obtained applying SVDs to truncated kernels of different sizes and monitoring 
convergence of eigenvectors and eigenvalues. The final outcome was obtained with 
d = 2000. 


7.4.3 Mercer Representations of Stable Reproducing Kernel 
Hilbert Spaces and of Regularized Estimators 


Now we exploit the representations of the RKHSs induced by a diagonalized kernel 
as discussed in Theorems 6.10 and 6.13 (where compactness of the input space is not 
even required). In view of Proposition 7.2, assuming for simplicity all the ¢; different 
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from zero, one obtains that the RKHS associated to a stable K always admits the 
representation 


KH = fs => ap st Y ~ +oo}, (7.67) 
i=l 


i=1 °l 


where the p; are the eigenvectors of K forming an orthonormal basis of £2.! If 
g = 32, ipi, one also has 


w 2 


Iie =>. (7.68) 
jai i 


The fact that any stable RKHS is generated by an £> basis gives also a clear 
connection with the important impulse response estimators which adopt orthonormal 
functions, e.g., the Laguerre functions illustrated in Fig. 7.3 [46, 91, 92]. A classical 
approach used in the literature is to introduce the model g = }_; a; p; and then to use 
linear least squares to determine the expansion coefficients a;. In particular, let L;[g] 
be the system output, i.e., the convolution between the known input and g evaluated 
at the time instant t. Then, the impulse response estimate is 


d 
8=)>> ain; (7.69a) 
i=1 
N d 2 
(aj, = ach 2 y(t) — Ly 2 aipi|) . (7.69b) 


where d determines model complexity and is typically selected using AIC or cross- 
validation (CV) as discussed in Chap. 2. 

In view of (7.67) and (7.68), the regularized estimator (7.10), equipped with a 
stable RKHS, is equivalent to 


oe) 
f= >> 4p; (7.70a) 
isil 
N lo) 2 lo) a2 
{â} =argmin X` yO- L| >> aal) +y) T (7.70b) 
lai p=1 i=1 i=1 7! 


l In (7.67), we have assumed that all the kernel eigenvalues are strictly positive so that # is infinite 
dimensional. If some ¢; is null, # is spanned only by the eigenvectors associated to those non-null. 
If only a finite number of ¢; is different from zero, K is finite rank and # is finite dimensional. 
A notable case is that of the RKHSs induced by truncated kernels, i.e., such that there exists d 
such that K;; = 0 Vi > d. As we have seen, this kind of kernels induce finite-dimensional RKHSs 
containing FIR systems of order d. 
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This result is connected with the kernel trick discussed in Remark 6.3 and shows 
that regularized least squares in a stable (infinite-dimensional) RKHS always model 
impulse responses using an £2 orthonormal basis, as in the classical works on linear 
system identification. But the key difference between (7.69) and (7.70) is that com- 
plexity is no more controlled by the model order because d is set to oo. Complexity 
instead depends on the regularization parameter y (and possibly also on other ker- 
nel parameters) that balances the data fit and the penalty term. This latter induces 
stability by using the kernel eigenvalues ¢; to constrain the decay rate to zero of the 
expansion coefficients. 


7.4.4 Necessary and Sufficient Stability Condition Using 
Kernel Eigenvectors and Eigenvalues 


We have seen that a fruitful way to design a regularized estimator for linear system 
identification is to introduce a kernel by specifying its entries K;;. This modelling 
technique translates our expected features of an impulse response into kernel prop- 
erties, e.g., smooth exponential decay as described by stable spline, TC and DC 
kernels. This route exploits the kernel trick, i.e., the basis functions implicit encod- 
ing. In some circumstances, it could be useful to build a kernel starting from the 
design of eigenfunctions p; and eigenvalues ¢;. A notable example is given by the 
(already cited) Laguerre or Kautz functions that belong to the more general class of 
Takenaka—Malmquist orthogonal basis functions [46]. They can be useful to describe 
oscillatory behavior or presence of fast/slow poles. 

Since any stable kernel can be associated with an £ basis, the following funda- 
mental problem then arises. Given an orthonormal basis {p; } of £2, for example, of the 
Takenaka—Malmquist type, which are the conditions on the eigenvalues ¢; ensuring 
stability of K,, = pak ¢i pi (x); (y)? The answer is in the following result derived 
from [8] that reports the necessary and sufficient condition (the proof is given in 
Sect. 7.7.5). 


Theorem 7.9 (RKHS stability using Mercer expansions, based on [8]) Let # be 
the RKHS induced by K with 


+00 
Ko =} Goa), 
i=l 


where the {pi} form an orthonormal basis of £2. Let also 
Ua { UE boo lu] = 1, vi =i}. 


Then, one has 


Hob, 4 sup $ Gi(p;, u)3 < +00, (7.71) 


UEY% i 
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where (-, -)2 is the inner product in £2. 


Thus, clearly, there is no stability if one function p; associated to ¢; > 0 doesn’t 
belong to £;. In fact, one can choose u containing the signs of the components of 
pi and this leads to (0;, u)z = +00. Nothing is instead required for the eigenvectors 
associated to ¢; = 0. Theorem7.9 permits also to derive the following sufficient 
stability condition. 


Corollary 7.3 (based on [8]) Let # be the RKHS induced by the kernel Ky = 
= 6; pi (x) 0; (y) with {p;i} an orthonormal basis of £>. Then, it holds that 


HC b= DP Gilel < +00. (7.72) 
i 
Furthermore, such condition also implies kernel absolute summability and, hence, 
it is not necessary for RKHS stability. 


It is easy to exploit the stability condition (7.72) to design models of stable impulse 
responses starting from an £2 basis. Let us reconsider, e.g., Laguerre or Kautz basis 
functions {p;} to build the impulse response model 


CO 
& = > qj Pi- 
i=1 


To exploit (7.70), one has to define stability constraints on the expansion coefficients 
ai. This corresponds to define ¢; in such a way that the regularizer 


[o0] 

2. 

i=1 i 
enforces absolute summability of g. Laguerre and Kautz models belong to the 
Takenaka—-Malmquist class of functions p; that all satisfy 


lola < Mi, 


with M a constant independent of i [46]. Then, Corollary 7.3 ensures that the choice 


v 


&xi”, v>2 


includes the stability contraint for the entire Takenaka—Malmquist class. 

Let us now consider the class of orthonormal basis functions p; all contained in 
a ball of £1. Then, the necessary and sufficient stability condition assumes a form 
especially simple as the following result shows. 


Corollary 7.4 (based on [8]) Let # be the RKHS induced by the kernel Kyy = 
y tipi (x)pi(y) with {p;i} an orthonormal basis of £2 and ||pil|ı < M < +% if 
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Fig. 7.13 Inclusion properties of some important kernel classes in terms of Mercer expansions. 
This representation is the dual of that reported in Fig. 7.11 and defines kernel sets through properties 
of the kernel eigenvectors ;, forming an orthonormal basis in £2, and of the corresponding kernel 
eigenvalues ¢;. The condition >; || pi A < oo is the most restrictive since it implies kernel absolute 
summability. The necessary and sufficient condition for stability is supyey,, >; Si (Pi. u) < ©. 
Finally, $`; ¢; < oo and }-; ce < oo are exactly the conditions for a kernel to be finite trace and 
squared summable, respectively 


¢; > 0, with M not dependent on i. Then, one has 


Hk, => VG < +00. (7.73) 


Finally, Fig. 7.13 illustrates graphically all the stability results here obtained start- 
ing from Mercer expansions. 


7.5 Minimax Properties of the Stable Spline Estimator «x 


In this section, we will derive non-asymptotic upper bounds on the MSE of the 
regularized IIR estimator (7.10) valid for all the exponentially stable discrete-time 
systems whose poles belong to the complex circle of radius p. Obtained bounds can 
be evaluated before any data is observed. This kind of results give insight into the 
so-called sample complexity, i.e., the number of measurements needed to achieve a 
certain accuracy on impulse response reconstruction. This is an attractive feature even 
if, since the bounds need to hold for all the models falling in a particular class, often 
they are quite loose for the particular dynamic system at hand. However, they have a 
considerable theoretical value since permit also to assess the quality of (7.10) through 
nonparametric minimax concepts. Such setting considers the worst-case inside an 
infinite-dimensional class and has been widely studied in nonparametric regression 
and density estimation [88]. In particular, obtained bounds will lead to conditions 
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which ensure the optimality in order, i.e., the best convergence rate of (7.10) in the 
minimax sense. We will derive them by considering system inputs given by white 
noises and using the TC/stable spline kernel (7.15) as regularizer. The important 
dependence between the convergence rate of (7.10) to the true impulse response, the 
stability kernel parameter a and the stability radius p will be elucidated. 


7.5.1 Data Generator and Minimax Optimality 


As in the previous part of the chapter, we use g? to denote the impulse response of 
a discrete-time linear system. The measurements are generated as follows: 


yt) = $ 8 (ur + elt), (1.74) 


k=1 


where g°(k) are the impulse response coefficients. We will always assume g° as a 
deterministic and exponentially stable impulse response, while the input u and the 
noise e are stochastic as specified below. 


Assumption 7.10 The impulse response g° belongs to the following set: 
FL) = |g: l@s Lo}, 0<p <1. (7.15) 


The system input and the noise are discrete-time stochastic processes. One has that 
{u(t)},ez are independent and identically distributed (i.i.d.) zero-mean random vari- 
ables with 

&[u(t)*] = 07, |ult)| < Cy <. (7.76) 


u 


Finally, {e(t)}rez are independent random variables, independent of {u(t)}rez, with 
E[e(t)] =0, E[e(t)?] < 07. (7.77) 
The available measurements are 


Dr = {u(1),...,u(N), y), ..., y(N)}, (7.78) 


where N is the data set size. 

The quality of an impulse response estimator g function of Dr will be measured 
by computing the estimation error &||g° — |l2, where || - ||2 is the norm in the space 
£2 of squared summable sequences. Note that the expectation is taken w.r.t. the 
randomness of the system input and the measurement noise. The worst-case error 
over the family . of exponentially stable systems reported in (7.75) will be also 
considered. In particular, the uniform £2-risk of g is 
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gES 


An estimator g* is then said to be minimax if the following equality holds for any 
data set size N: 


sup Êllg — g*l2 = inf sup £llg — gle, 
gES E ger 


meaning that g* minimizes the error w.r.t. the worst-case scenario. Building such 
kind of estimator is in general really difficult. For this reason, it is often convenient 
to consider just the asymptotic behaviour introducing the concept of optimality in 
order. Specifically, an estimator g is optimal in order if 


sup &|lg — gll2 < Cy sup Ellg — g* lle 
BES LES 


with Cy is function of the data set size and satisfies supy Cy < oo and g* is 
minimax. In our linear system identification setting, optimality in order thus ensures 
that, as N grows to infinity, the convergence rate of g to the true impulse response 
g? cannot be improved by any other system identification procedure in the minimax 
sense. 


7.5.2 Stable Spline Estimator 


As anticipated, our study is focused on the following regularized estimator: 


N lo) 
ê = arg min YO — dS gut- K + yle, (7.79) 
gE pal k=1 


equipped with the stable spline kernel 
K(i, j) =a™*“), O<a<1, (i,j) eN. (7.80) 


For future developments, it is important to control complexity of (7.79) not only by 
using the hyperparameters y and a but also through the dimension d of the following 
subspace: 

Hy ={g€ Hs. g(d+1) = g(d+2)=---=0} 


over which optimization of the objective in (7.79) is performed. In particular, we will 
consider the estimator 


N d 2 
ye : _ = 2. 
gf = arg min 2 bo X gult v) + yilglloe (7.81) 


k=1 


7.5 Minimax Properties of the Stable Spline Estimator x 289 


and will study how N and the choice of y, œ, d influence the estimation error and, 
hence, the convergence rate. This will lead to complexity control rules that are a 
hybrid of those seen in the classical and in the regularized framework. To obtain 
this, first, we rewrite (7.81) in terms of regularized FIR estimation by exploiting the 
structure of the stable spline norm (7.16) which shows that 


d-1 2 2 
(g(t +1) — g(@)) gd) 
EH, =K|12||2, = 7.82 
g € Hy lel 2 cae ae (7.82) 
Let us define the matrix 
1 | 0 0 0 
|- 1+4 -i 0 0 
1 1 1 il 
R= 0 -3 ata a 0 (7.83) 
eo * : 
(0. S oboe ee 
and the regressors 
u(t — 1) 
galt) = : ; (7.84) 
u(t — d) 


Now, one can easily see that the first d components of ¢@ in (7.81) are contained in 
the vector 


N 
argmin X` (y(t) — ga(t)"6)” + yO? RO. (7.85) 
a t=1 
Hence, we obtain 
êl = (&(1),..., &(d), 0,0,...) (7.86) 
where 
a(1) 1 N —1 1 N 
: = (5 x ga(t)py (t) + Zr) T > ga(t) y(t). (7.87) 
@(d) t=1 t=1 


In real applications, one cannot measure the inputs at all the time instants and our 
data set Dr in (7.78) could contain only the inputs u(1),..., u(N). So, differently 
from what postulated in the above equations, in practice the regressors are never 
perfectly known. One solution is just to replace with zeros the unknown input values 
{u(t)},<1 entering (7.84). Also under this model misspecification, all the results 
introduced in the next sections still hold. 
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7.5.3 Bounds on the Estimation Error and Minimax 
Properties 


The following theorem will report non asymptotic bounds that illustrate the depen- 
dence of &||¢° — “|| on the following three key variables: 


e the FIR order d which determines the truncation error; 

e the parameter œ contained in the matrix R reported in (7.83) that establishes the 
exponential decay of the estimated impulse response coefficients; 

e the regularization parameter y which trades-off the penalty defined by R and the 
adherence to experimental data. 


In addition, it gives conditions on a which ensure optimality in order if some con- 
ditions on the stability radius p entering (7.75) and on the FIR order d (function of 
the data set size N) are fullfilled. Below, the notation O(1) indicates an absolute 
constant, independent of N. Furthermore, given x € R, we use |x] to indicate the 
largest integer not larger than x. The following result then holds. 


Theorem 7.11 (based on [74]) Let the FIR order d be defined by the following 
function of the data set size N: 


In(N(1 — aa?) — In(8 
ge | MONG = ado?) ~ ny) | ee 
In(1/a@) 
with N large enough to guarantee d* > 1. 
Then, under Assumption 7.10, the estimator (7.81) satisfies 
Ellg — 8 l2 (7.89) 
ig d* o jd) 4Ly he 
<o J — +1] +> + f 
- ol N Ou Y N l-a N 
where 
~v d* ifa =p 
p : 
ial Vege. Orr. (7.90) 


a, 
Jaca (a) ifa<p 


Furthermore, if the measurement noise is Gaussian and ./a > p, the stable spline 
estimator (7.81) is optimal in order. 


To illustrate the meaning of Theorem 7.11, first is useful to recall a result obtained 
in [43] that relies on the Fano’s inequality. It shows that, if a dynamic system is fed 
with white input and the measurement noise is Gaussian, the expected £> error of 


any impulse response estimator cannot decay to zero faster than ,/ “¥ in a minimax 


N 
sense. 
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Theorem 7.12 (based on [43]) Let Assumption 7.10 hold and assume also that the 
measurement noise is Gaussian. Then, if g is any impulse response estimator built 
with Dr, for N sufficiently large one has 


sup ĉl- gle = O(1),/ —_. (7.91) 
geS (0,L) N 


To illustrate the convergence rate of the stable spline estimator, first note that 
the FIR dimension d* in (7.88) scales logarithmically with N. Apart from irrelevant 
constants, one in fact has 

ln(N) 
ln(1/æ)` 


E o 


(1.92) 


We now consider the three terms on the r.h.s. of (7.89) with d = d*. Since 


Tada up e (7.93) 
— Past es an N na š 3 
N N P 


the first two terms decay to zero at least as ,/ 2 


Regarding the third one, one has 


aN 
ie |e Be? 
~ 1 ifa >p (7.94) 


Nm ifa<p 


and this shows that the optimal convergence rate is obtained if œ > p but the case 
a < p can be critical. In particular, combining (7.89) with (7.93) and (7.94), the 
following considerations arise: 


e the convergence rate of the stable spline estimator (7.81) does not depend on y but 
only on the relationship between the kernel parameter œ and the stability radius p 
defining the class of dynamic systems (7.75); 

e using Theorem 7.12, one can see from (7.94) that if a < p the achievement of the 
optimal rate is related to the term N -me which appears as third term in (7.89). 


The key condition is 

ln p 

— > 05 =>vVa > p. 

lng 
This indeed corresponds to what was stated in the final part of Theorem 7.11: under 
Gaussian noise the stable spline estimator is optimal in order if ./a is an upper 
bound on the stability radius p. 


Relationships (7.93) and (7.94) clarify also what happens when the kernel includes 
a too fast exponential decay rate, i.e., when ./a < p. In this case, the error goes to 
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re Convergence rate 
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Fig. 7.14 Convergence rate In p/ ln of the stable spline estimator as a function of ,/a@ for /a < p 
with p in the set {0.7, 0.8, 0.9, 0.95, 0.99}. When Ja < p the estimation error converges to zero as 


Inp r : : > : 
N` ma , Instead, if ./o > p the error decays as my , making the stable spline estimator optimal 


in order when the measurement noise is Gaussian 


Inp 


zero as N~ inc, getting worse as ./a drifts apart p. Such phenomenon has a simple 
explanation. A too small œ enforces the impulse response estimate to decay to zero 
also when the true impulse response coefficients are significantly different from zero. 
This corresponds to a strong bias: a wrong amount of regularization is introduced 
in the estimation process, hence compromising the convergence rate. This is also 
graphically illustrated in Fig.7.14 that plots the convergence rate Inp/Ina as a 
function of ./a@ for five different values of p. 

The analysis thus shows how a plays a fundamental role in controlling impulse 
response complexity and, hence, in establishing the properties of the regularized 
estimator. This is not surprising also in view of the deep connection between the 
decay rate and the degrees of freedom of the model. This was illustrated in Fig. 5.6 
of Sect.5.5.1 using the class of DC kernels which includes TC as special case. 


7.6 Further Topics and Advanced Reading 


The idea to handle linear system identification with regularization methods in the 
RKHS framework first appears in [72]. As already mentioned, the representer theo- 
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rems introduced in this chapter are special cases of that involving linear and bounded 
functionals reported in the previous chapter, see Theorem 6.16. More general versions 
of representer theorems with, e.g., more general loss functions and/or regularization 
terms can be found in, e.g., [33]. Similarly to the spline smoothing problem studied 
in Sect.6.6.7, it could be useful to enrich the regularized impulse response esti- 
mators here described with a parametric component. Of course, the corresponding 
regularized estimator will still have a closed-form finite-dimensional representation 
that depends on both the number of data N and the number of enriched parametric 
components, e.g., see [72, 90]. 

The stable spline kernel [72] and the diagonal correlated kernel [19] are the first 
two kernels introduced in the linear system identification literature. The stability of 
a kernel (or equivalently the stability of a RKHS) first appeared in [32, 73]. The 
stability of a kernel is equivalent to the oo-boundedness of the kernel, which is a 
special case of the more general g-boundedness with 1 < q < œ in [16]. The proof 
in [16] for the sufficiency and necessity of the g-boundedness of a kernel is quite 
involved and abstract. Theorem 7.5 is also discussed in [24], see also [76] where the 
stability analysis exploits the output kernel. The optimal kernel that minimizes the 
mean squared error was studied in [19, 73]. As already discussed, unfortunately, the 
optimal kernel cannot be applied in practice because it depends on the true impulse 
response to be estimated. Nevertheless, it offers a guideline to design kernels for linear 
system identification and more general function estimation problems. Motivated by 
these findings, many stable kernels have been introduced over the years, e.g., [17, 
21, 77, 80, 97]. In particular, [17] proposed linear multiple kernels to handle systems 
with complicated dynamics, e.g., with distinct time constants and distinct resonant 
frequencies, and [77] further extended this idea and proposed “integral” versions of 
the stable spline kernels. To design kernels to embed more general prior knowledge, 
e.g., the overdamped/underdamped dynamics, common structure, etc., it is natural to 
divide the prior knowledge into different types and then develop systematic ways to 
design kernels accordingly, see [21, 80, 97]. In particular, the approaches proposed 
in [21] are based on machine learning and a system theory perspectives, those in 
[80] rely on the maximum entropy principle, and the method proposed in [97] uses 
harmonic analysis. 

Along with the kernel design, many efforts have also been spent on “kernel anal- 
ysis”. In particular, many kernels can be given maximum entropy interpretations 
including the stable spline kernel, the diagonal correlated kernel and the more gen- 
eral simulation-induced kernel [14, 21, 23]. This can help to understand the prior 
knowledge embedded in the model. Many kernels have the Markov property e.g., 
[83]. Examples are the diagonal correlated kernel and some carefully designed sim- 
ulation induced kernels [21]. Exploring this property could help to design efficient 
implementation. As we have seen, the spectral analysis of kernels is often not avail- 
able in closed form, even is it can be numerically recovered, but exceptions include 
the stable spline and the diagonal correlated kernel [20, 22, 72]. 

The hyperparameter tuning problem has been studied for a long time in the context 
of function estimation problem from noisy observations, e.g., [83, 90]. The marginal 
likelihood maximization method depends on the connection with the Bayesian esti- 
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mation of Gaussian processes, which was first studied in [51] in spline regression, see 
also [41, 83, 90]. More discussions on its relation to Bayesian evidence and Occam’s 
razor principle can be found in e.g., [27, 60]. Stein’s unbiased risk estimation method 
is also known as the C, statistics [61]. The generalized cross-validation method is 
first proposed in [28] and found to be rotation invariant in [44]. The problem can also 
be tackled using full Bayes approaches relying on stochastic simulation techniques, 
e.g., Markov chain Monte Carlo [1, 39]. 

In the context of linear system identification, some theoretical results on the hyper- 
parameter estimation problem have been derived. In particular, it was shown in [4] 
that the marginal likelihood maximization method is consistent for diagonal kernels 
in terms of the mean square error and asymptotically minimizes a weighted mean 
square error for nondiagonal kernels. In [78], the robustness of the marginal likeli- 
hood maximization is analysed with the help of the excess degrees of freedom. It is 
further shown in [63, 64, 66] that Stein’s unbiased risk estimation as well as many 
cross-validation methods are asymptotically optimal in the sense of mean square 
error. In [4, 17, 94], the optimal hyperparameter of the marginal likelihood maxi- 
mization is shown to be sparse. By exploring such property it is possible to handle 
various structure detection problems in system identification like sparse dynamic 
network identification [17, 26]. Full Bayes approaches can be found, e.g., in [69]. 

As also recalled in the previous chapter, straightforward implementation of the 
regularization method in RKHS framework has computational complexity O (N°) 
and thus is prohibitive to apply when N is large. Many efficient approximation meth- 
ods have been proposed in machine learning, e.g., [53, 81, 82]. In the context of linear 
system identification, there is another practical issue that must be noted in the imple- 
mentation: the ill-conditioning possibly arising from the use of stable kernels, which 
is unavoidable due to the nature of stability. Hence, extra care has to be taken when 
developing efficient implementations. Some approximation methods have been pro- 
posed to reduce the computational complexity and avoid numerical computation. 
The first one is to truncate the IIR at a suitable finite-order n. Then, computational 
complexity becomes O (n?) and one can also use the approach proposed in [18] rely- 
ing on some fundamental algebraic techniques and reliable matrix factorizations. 
The other one is to truncate the infinite expansion of a kernel at a finite-order /. 
Then, computational complexity becomes O (1°), see [15]. See also [36] for efficient 
kernel-based regularization implementation using Alternating Direction Method of 
Multipliers (ADMM). Another practical issue is the difficulty caused by local min- 
ima. For kernels with few number of hyperparameters, e.g., the stable spline kernel 
and the diagonal correlated kernel, this difficulty can be well faced using different 
starting points or also some grid methods. For systems with complicated dynam- 
ics, it is suggested to apply linear multiple kernels [17] since the corresponding 
marginal likelihood maximization is a difference of convex programming problem 
and a stationary point can be found efficiently using sequential convex optimization 
technique, e.g., [48, 87]. 

We only considered single-input single-output linear systems in the chapter with 
white measurement noise. For multiple-input single-output linear systems, it is nat- 
ural to use multi-input impulse response models and then assume that the overall 


7.6 Further Topics and Advanced Reading 295 


system has a block diagonal kernel [73]. The regularization method can also be 
extended to handle linear systems with colored noise, e.g., ARMAX models. One 
can exploit the fact that such systems can be approximated arbitrarily well by finite- 
order ARX models [57]. The problem thus becomes a special case of multiple-output 
single-input systems where the regressors contain also past outputs [71]. This will 
be also illustrated in Chap. 9. 

In practice, the data could be contaminated by outliers due to a failure in the 
measurement or transmission equipment, e.g., [56, Chap. 15]. In the presence of out- 
liers, it is suggested to use heavy-tailed distributions instead of the commonly used 
Gaussian distribution for the noise in robust statistics, e.g., [49]. For regularization 
methods in the RKHS framework, the key difficulty is that the hyperparameter esti- 
mation criteria and the regularized estimate may not have closed-form expressions. 
Several methods have been proposed to overcome this difficulty. In particular, an 
expectation maximization (EM) method was proposed in [10] and further improved 
in [55] exploiting a variational expectation method. 

Input design is an important issue for classical system identification and many 
results have been obtained, e.g., [38, 45, 47, 56]. For regularized system identifica- 
tion in RKHS, some results have been reported recently. The first result was given 
in [37] where the mutual information between the output and the impulse response 
was chosen as the input design criterion. Unfortunately, obtaining the optimal input 
involves the solution of a nonconvex optimization problem. Differently from [37], 
[65] adopts scalar measures of the Bayesian mean square error as input design crite- 
rion, proposing a two-step procedure to find the global optimal input through convex 
optimization. 

For what concerns the building of uncertainty regions around the dynamic system 
estimates, approaches are available which return bounds that, beyond being non- 
asymptotic, are also exact, i.e., with the desired inclusion probability. This requires 
some assumptions on data generation, like the introduction of prior distributions on 
the impulse response. An important example, already widely discussed in this book, is 
the use of a Bayesian framework that interprets regularization as Gaussian regression 
[83]. The posterior density becomes available in closed form and Bayes intervals can 
be easily obtained. Another approach to compute bounds for linear regression is the 
sign-perturbed sums (SPS) technique [30]. Following a randomization principle, it 
builds guaranteed uncertainty regions for deterministic parametric models in a quasi- 
distribution free setup [11, 12]. Recently, there have been notable extensions to the 
class of models that SPS can handle. The first line of thought still sees the unknown 
parameters as deterministic but introduces regularization, see [29, 70, 89] and also 
[31] which is a first attempt to move beyond the strictly parametric nature of SPS. A 
second line of thought allows for the exploitation of some form of prior knowledge 
at a more fundamental probabilistic level [13, 70]. 

Finally, the interested readers are referred to the survey [73] for more references, 
see also [25, 58]. 
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7.7 Appendix 


7.7.1 Derivation of the First-Order Stable Spline Norm 


We will exploit a representation of the RKHS induced by the first-order discrete- 
time stable spline kernel given by a linear transformation of the space £2 containing 
the squared summable sequences. This has some connections with the relationship 
between squared summable function spaces and RKHS discussed in Remark 6.2, 
even if no spectral decomposition of the kernel will be needed below. 

Let # be the RKHS induced by the stable spline kernel (7.15) with elements 
denoted by g = {g;}°). We will see that any g € 2f can be written as 


[0,6] 
8 = X Wii, we ba, (7.95) 
j=l 


where the scalars {yw;,;} define the linear operator mapping £2 into #. By adopting 
notation of ordinary algebra to handle infinite-dimensional objects, one can see g 
as an infinite-dimensional column vector. In addition, (7.95) can be rewritten as 
g = Ww, where W is an infinite-dimensional matrix with (t, j)-entry given by y;;. 
We will now obtain the expression of W. Let 


A =diag{A,, Ao, Ag, -h Arp =a’? — at! 
t 
M = [v! v? v8.--], v= e 
j=l 


where e/ is the infinite-dimensional column vector with all null elements except its 
jth entry which is equal to one. Let also 


Y = MA", (7.96) 


The inverse WT! of W acts as follows: given a sequence g, it maps g into 


g (81 82) 
yle = | z (82 — 83) |. (1.97) 


Then, given ¥ in (7.96), we will show that the space 


x =| ¥w| we bl, (7.98) 
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with inner product given by 


(fg) = (Uf, Wt g)0, (7.99) 


is the RKHS induced by the stable spline kernel. First, it is easy to see that the null 
space of W contains only the null vector. Then, since £> is Hilbert, one obtains that 
H is a Hilbert space. We can now exploit Theorem 6.2, i.e., the Moore—Aronszajn 
theorem, to prove that it is also the desired RKHS. To obtain this, the two conditions 
described below have to be checked. 

The first condition says that any kernel section must belong to the space # in 
(7.98). Thanks to the algebraic view, we can see the stable spline kernel K as an 
infinite-dimensional matrix. Hence, the kernel sections are the infinite-dimensional 
columns of K and, in particular, we use K, to indicate the tth column. Now, one has 
to assess that || K; l2 < oo Yt. Note that 


0 


0 
wlK, = 7.100 
i Vat — attt < tth row. ( ) 
[ttl pai qit2 


Then, we have 
(Ki, Ki) = (WK, WK,)o 


+00 
= Xo — at!) = a! < o0 


j=t 


and the first condition is so satisfied. 
The second condition is the reproducing property, i.e., one has to assess that 


(Ki, 8) = 8g Vee H, Mt. 
This holds true since 


(Ki, g) = (WK, UW" g)2 


+00 
> (ej — 8j) = 8, 


j=t 


showing that the second condition is also satisfied. 
Using (7.99), one has 
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(oe) 


lgl = (Wo1g, Wo 'g)2 => 
t=1 


Giri = 81)” 
(1 —a)a! 


and this confirms the norm’s structure reported in (7.16). 


7.7.2 Proof of Proposition 7.1 


We will exploit the results on estimation of Gaussian vectors reported in Sect. 4.2.2. 
Let Cov[u, v] denote the covariance matrix of two random vectors u and v, i.e., 


Cov[u, v] := él(u — Eluh — [iv] T]. 


First, we consider the distribution of Y. Note that L;[g®] is a linear functional of 
the stochastic process g°. Hence, since linear transformations of normal processes 
preserve Gaussianity, the noise-free output [Lilg°l, sa aa vile}? is a multivariate 
zero-mean Gaussian random vector. Furthermore, since 


Cov(Lilg°], Lj[g°)) = AL AL (Kl, 


the covariance matrix of [Li[g°],..., Lyw{g°]]", apart from the scale factor A, is 
indeed defined by the output kernel matrix O reported in (7.14) for the discrete- 
time case, i.e., when 2 = N, and in (7.22) for the continuous-time case, i.e., when 
X = R*+. Now, recall that the e(t), where t = 1,..., N, are assumed to be mutually 
independently Gaussian distributed with mean zero and variance 0”. Moreover, they 
are also assumed independent of g°. One then obtains that g? and Y are jointly 
Gaussians, with the mean and covariance matrix of Y given by 


@(Y)=0, Cov(Y,¥)=AaO +07 ly. 


For what regards the covariance matrix of g° and Y, the independence assumptions 
imply that 


Cov(g°(x), Y) = ALLilKx],---, Lv [Ke]. 
Then, using also the correspondence y = o7/A, we have 


El g°(x)I¥] = MLiLKy]...Lw[Kx]] (AO +0°In) Y 
= [Li[K,]...Ly[Kx]](O+ yn) Y 
N 
LiKe] 
=j 


t 
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where C, is the tth entry of vector ¢ defined in (7.13) for the continuous-time case or 
in (7.21) for the discrete-time case. This completes the proof. 


7.7.3 Proof of Theorem 7.5 


We only consider the proof for the discrete-time case (7.56). The continuous-time 
case (7.57) can be proved in a similar way. To prove (7.56), we first need a lemma. 


Lemma 7.1 Consider the linear operator Lx defined by 
[o0] 
Le) = L KC, Dh, (7.101) 
t=1 


where K : N x N > R is a positive semidefinite kernel. Assume that Lx satisfies 
the following property: for any l € Læ, one has Lg[l] € £. Then, Lx is a continuous 
(bounded) linear operator, i.e., there exists a scalar b > 0, independent of l, such 
that 

lLk < blll, Wl € £w. (7.102) 


Proof First, we show that for any s € N, the kernel section K,(-) belongs to £1. To 
show this, for any s € N, we can define a sequence / € £,, in the following way: 


L= 1 if K(s,t)>0 
t ) —1 — otherwise. 


Then plugging this / into (7.101) yields Lx[/] = D |K (s, t)|. Since Lx[/] € £1 
for every l € læ, then we obtain 


YIK(s,)| < 00, Ws EN. (7.103) 


t=1 


Now, for any l, a € Læ, it holds that 


ILxl](s) — Lx[a](s)| = < ll — all X |K (s, 01, 


t=1 


XOK, Nl — a) 


t=1 


(7.104) 


where both ||/ — a|lo. and ar |K (s, t)| are finite for any s € N since l, a € £% and 
in view of (7.103). Following (7.104), the remaining proof is a simple application of 
the closed graph theorem, see Theorem 6.26. In fact, let! —> a in £% and Lx[/] > g 
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in £1. Then (7.104) shows that Lx [/](s) > Lx[a](s) for every s € N, implying that 
gs = Lx[a](s) for every s € N. As a result, the graph (l, Lx[/]) is closed and thus 
Lx is continuous (bounded) by the closed graph theorem. 


Now let us consider (7.56) in Theorem 7.5. We first prove the sufficient part, i.e., 


5 5 K(s, Ðl 


s=1 i= 


<œ, Welly => H Ch. 


ji 


We start by introducing some definitions. For any f € #, we let l € lœ be a 
sequence defined by the signs of f, i.e., 


sali ipeo 
t~ ) —1 otherwise 


and let also /” be a sequence defined by 


P= l; fort = 1,...,n 
t |o otherwise. 


Then we have 


XIs = ys = SFO MK Ove. 
t=1 


t=1 


where the last identity is due to the reproducing property of K. Moreover, by the 
Cauchy—Schwarz inequality, we have 


(7.105) 


XOIA < fle 
t=1 


[0,0] 
So KC) 
t=1 H 


Now we show that | ae 14 KG )| # is finite. First, we note that 


= ERO OU KO) 
s=1 t=1 


(oe) 


=) (> I"K(s, D) rig 
=i = 


7 WK, t)) 


2 t=1 


s=l 


00 2 
SO KC) 
t=1 


H 


ES 


and then from the linear operator Lx defined in (7.101) and its boundedness property 
(7.102) proved in Lemma7.1, we obtain 
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< Le Mae loo < bI = b, 


CO 
PKC 
t=1 


2 
H 
where we have used the fact that ||” ||,, = 1 for any n € N. Noting the above equation 
and (7.105) yields 


XOIA <i fllevb, vn EN. 


t=1 


Since f € # and thus || f||,v is finite, )°"_, | f;| is bounded above for any n € N. 
Further note that the partial sum }~’_, | f;| is an increasing sequence and bounded 
above, therefore by monotone convergence theorem, the limit of )~"_, | fil, i.e., 
limy—+oo X}; |f| exists, and is denoted by 5°, | f;|, which shows that f € £. 
Since f was chosen arbitrarily, this implies # C £, and thus completes the proof 
for the sufficient part. 

Now, we prove the necessary part, i.e., 


[0,0] 


Xch =}, 


s=l 


(oe) 


SU K(s, 0) 


t=1 


< œ VIE ly. 


Again, we start by introducing some definitions. For any f € # and l € læ, we 
define a new sequence If by letting [Lf]; = l: ft, Vt € N, where [If]; is the rth 
entry in the sequence /f. Then we have If € £1, because l € £% and f € £, due to 
H C l. Moreover, we define g”(-) = $; 1, K:(-) within € N. Now we show that 
the sequence of functions g”(-) with n € N is a weak Cauchy sequence in #. To 
show this, we take without loss of generality m < n and m € N, and then we have 


e()-— 2" = > LEC): (7.106) 
t=m+1 


Moreover, we have 


n 


O-O FO =Y LKO, FO = YL bfo VE eH. 


t=m+1 t=m+1 
Since lf € £1, i.e., Yi Il; fil < oo, the Cauchy criterion ensures that 
n 
li = : 
Frere L lh fal 0, 0 107) 
t=m+1 


which implies 
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n 


jim, =o 


t=m+1 


Noting the above equation and (7.106) yields that the sequence of functions g”(-) = 
yor 4K: (-) withn € N is a weak Cauchy sequence. Recall that every Hilbert space, 
beyond being complete, is also weakly sequentially complete, which is because every 
Hilbert space is reflexive, see Definition 2.5.23 along with Corollaries 2.8.10 and 
2.8.11 in [62]. Hence, the sequence of functions g"(-) = X}; 1, K;(-) withn € Nis 
also a weakly convergent sequence, i.e., there exists an h € # such that 


dim (O, FO) = AO, FO), VE EX. 


Now, we take f(-) = Ks(-) in the above equation. Using the reproducing property 
of K, the left-hand side becomes 


[e0] 
Jim (g"(), K Oæ =} LK(s,0, 
t=1 
while the right-hand side becomes 
(ACO), KsO)æ = h(s). 


This implies that 
oe) 
YS OLK(s, th=h(s) Ys EN. 


t=1 


Finally, note that h € # C £4, therefore 


KO) 


t= 


< œ, VI € ly, 


Me 


a 


BR 


Ss 


which completes also the necessary part and, hence, concludes the proof. 


7.7.4 Proof of Theorem 7.7 


First, it is useful to set up some notation. Let r be an integer or r = oo. Then, we 
define the set %, as follows: 


M@:={x ER :x@=+1,Vi=1,...,r}. (7.108) 
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Let p be another integer associated with the odd number m = 2p + 1 and with 
n = 2". We also use x; E€ Yp , with i = 1, 2,...,n, to indicate distinct vectors 
containing exactly m elements +1 (their ordering is irrelevant). Then, for any n = 
23,2°,27,..., then x m matrix V™ is given by 


V = [xi x9... an)” (7.109) 


and its rows contain all the possible permutations of +1. We now discuss the inclu- 
sions stated in the theorem. 

The inclusion.“ C .% derives from Corollary 7.1 where we have seen that abso- 
lute summability is a sufficient condition for kernel stability. The proof of the strict 
inclusion.“ C -% is not trivial and is reported in [7] where one can find a particular 
kernel, function of the matrices V™ in (7.109), that is stable but non-absolutely 
summable. 

For what concerns the inclusion .% C Sfr, let Mm denote a positive semidefinite 
matrix of sizem x m. Consider also the linear operator Mm : R” —> R” with domain 
and co-domain equipped, respectively, with the £% and the £; norms. Its operator 
norm is then given by 


I| Mm lloo,1 = max 


|Mmul]1 = max ||Mnx|l1, (7.110) 
lull XEUm 


1 


where the last equality follows from the so-called Bauer’s maximum principle for 
convex functions. First, we prove that 


trace(Mm) < ||Mnllo.1 <n trace(M,,). (7.111) 


For this aim, since V7 contains all the vectors in %,„ as columns, the problem is 
equal to evaluating 
Mm yor 


and to find the column with maximum £; norm. The £; norm of each column can be 
obtained as the scalar product of the column with a suitable x € Y,, containing the 
signs of the column entries. Hence, the n? entries of 


VOM, Vr 


surely contain these n €; norms. Furthermore, the maximum £, norm which needs 
to be found is the maximum of all these n? entries since xTc < x3c, Wx1 € Yp if 
X2 = sign(c), where the function sign returns, for each entry of c, value 1 if such entry 
is larger than zero and -1 otherwise. Also, since V M,, V7 is positive semidefinite, 


the maximum is found along its diagonal, i.e., 


|| Min Ilo, = Pe [VO Mn yT]. 


pei n 
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We now note that the trace of V M,,V" satisfies 
(n) (n)T 1 (n) (n)T 
trace[ V" Mn V] > Mnl, = — trace V" Mn V" J. 
n 


Finally, 


trace[ V ® Mm V®T] = trace[M, V7 V ™] 
= trace|[Mm(nIm)] = n trace[Min] 
and this proves (7.111). 
Now, think of M, as the k x k submatrix of the stable kernel represented by 


the infinite-dimensional matrix K. We also use Lg to denote the associated kernel 
operator mapping £ into £1. So, it holds that 


| Milloo,1 < WExlloo,1 < +00, Yk =1,2,..., 
where || Lx ||o0,1 indicates the operator norm of Lx, i.e., 


Lx lloo,1 = max ||Kx|l1. 
XEY% 


Using (7.111), we obtain 
trace[M;] < |lLxlloo,1; Yk = 1,2,... 


and, since trace[M;] is a monotone non-decreasing sequence upper-bounded by 
| Lx lloo,1 < +00, one also has 


Y= Kit < IIx llooa < +00. 


i 


This shows that the trace of any stable kernel is finite. Such inclusion is strict as the 
following example shows. Let the vector v s.t. v € €2 and v ¢ £1. Consider the kernel 


K =v". 


One has trace(K) = ||v||3 < +00. If w = sign (v) € læ one has Kw = v|lv||1 and 
this implies ||Kw]||ı = oo. So, the kernel K has finite trace but is unstable. 

The inclusion Sf, C Sz relies on the important relation between nuclear and 
Hilbert-Schmidt (HS) operators, e.g., see [35, 54, 84]. In particular, let K be a 
kernel, seen as an infinite-dimensional matrix, and let Lg be the induced kernel 
operator as a map from £2 into £2 itself. Given any orthonormal basis {v;} in £2, the 
nuclear norm of Lg is 
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(vi, Kvi)2, (7.112) 


Me 


i=1 


and is independent of the chosen basis. Then, Lx is said to be nuclear if (7.112) is 
finite. Its (squared) Hilbert-Schmidt (HS) norm is instead 


oe) 
X` Kill (7.113) 
i=1 


and is also independent of the chosen basis. Then, Lx is said to be HS if (7.113) 
is finite. It is also known that any nuclear operator is HS and can be written as the 
composition of two HS operators. 

For our purposes, we now exploit the fact that any finite-trace kernel induces a 
nuclear operator, as shown in [8]. So, one also has that (7.113) is finite and, choosing 
as {v;} the canonical basis {e;} of £2 , one obtains 


[0,6] 
X IlKe:l} =J K? < 00. (7.114) 
i=1 ij 


Such inclusion is also strict as illustrated via the example 
K = diag{1, 1/2, 1/3,...,1/k,...}. 


Finally, .72 is contained in the set of all the positive semidefinite infinite matrices. 
Furthermore, the inclusion is strict: this can be seen just considering the example 
K = vv”, where v is the infinite-dimensional column vector with all components 
equal to 1. 


7.7.5 Proof of Theorem 7.9 


The notation Lx is still used to denote the operator induced by the kernel K and 
mapping £% into £1. Its operator norm is || Lx ||oo,1 while (¢;, pi) are its eigenvalues 
and eigenvectors orthogonal in £2. From Theorem 7.5 and Lemma7.1, one has 


HE Ch <> |Lxlloo1 < +00. (7.115) 


Since the function 


Fw) = lyh =O WOE |Z Knuth) 


i h 


is convex, the Bauer’s maximum principle ensures that 
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(7.116) 


UCU x, UCU x, 


ello. = sup flu) = sup D> |X Kanuth) 
i h 


where 
Ure = [vetsi \u@)| =1, vii}. 


Using notation of ordinary algebra to deal with infinite-dimensional matrices, we 
can write K = U DU”, where D is diagonal and contains the eigenvalues ¢; of K 
while the columns of U contain the corresponding eigenvectors p;. One has 


y=Ux, x= DU" u 
and, hence, 


T 
x= |[C, <p, e >2 2 < p2, U >2...] 


y = ġı < p1, U >2 pit Ẹ2 < p2, U >2 pot.... 
Letting s(u) = sign(y), we obtain 


hu) := lyla = Do Si < pn, u >2< prh, s(u) >2 . 
h 


Using (7.116), also noticing that f (u) = h (u), this implies 


Lxlloo.1 = sup D> bi < pneu >2< Pr, s(u) >2 


UCU x5 h 


sup h(u). 


UEU x 


Now, define 


gU) = En Snlon,u)>, A= sup = tn(Ph. U) = sup glu). 


UEU x h UEY% 
Exploiting the definition of s (u), one has 
h(u) > glu) = ||Lr ll% 2 A. 


On the other hand, 
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hu) = D> talon, u)2(pn, 8(W))2 
h 


=D (Venton, u)2) (Venton, s(u))2) 
h 


< 2 mo JE En (Pn, S(U))3 
h h 
< ygu) g(s(u)) 


that implies 


lLxllo,1 < A. 


So, one has 


«lor = sup >> Sn(pn,u)3 


UCU x5 h 


and this concludes the proof in view of (7.115). 
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Chapter 8 A) 
Regularization for Nonlinear System ciecie; 
Identification 


Abstract In this chapter we review some basic ideas for nonlinear system identifica- 
tion. This is a complex area with a vast and rich literature. One reason for the richness 
is that very many parameterizations of the unknown system have been suggested, 
each with various proposed estimation methods. We will first describe with some 
details nonparametric techniques based on Reproducing Kernel Hilbert Space the- 
ory and Gaussian regression. The focus will be on the use of regularized least squares, 
first equipped with the Gaussian or polynomial kernel. Then, we will describe a new 
kernel able to account for some features of nonlinear dynamic systems, including 
fading memory concepts. Regularized Volterra models will be also discussed. We 
will then provide a brief overview on neural and deep networks, hybrid systems 
identification, block-oriented models like Wiener and Hammerstein, parametric and 
nonparametric variable selection methods. 


8.1 Nonlinear System Identification 


In Sect. 2.2, Eq. (2.2), a model of a dynamical system was defined as a predictor 
function g that maps past input—output data 


Z! = {y(t — 1), u(t — 1), y(t — 2), u(t — 2), ...} 


to the next output 
$(t|0) = g(t, 0, Z''), (8.1) 


where 0 is a parameter vector that indexes the model. The predictor could possibly 
also be a nonparametric map belonging to some function class. If g is a nonlinear 
function of Z‘~! the model is nonlinear and the task to infer it from all the avail- 
able measurements contained in the training set Yr is the task of Nonlinear System 
Identification. This is a complex area with a vast and rich literature. One reason for 
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the richness is that very many parameterizations of g have been suggested, each 
with various proposed estimation methods, e.g., see the survey [36]. The different 
parameterizations allow various degrees of prior knowledge about the system to be 
accounted for, which gives grey box models with different shades of grey: see the 
section The Palette of Nonlinear Models in [36]. 

A typical element of nonlinear models is that somewhere in the structure there 
can be a Static nonlinearity present, ¢ (t) = h(n(t)). Dealing with static nonlinearities 
is therefore an essential feature in nonlinear identification. See the sidebar “Static 
Nonlinearities” in [36], and Sect. 8.5.2 for some brief remarks. 

If no prior physical knowledge is available, we have a black-box model. Then we 
need to employ parameterizations for g that are very flexible and can describe any 
reasonable function with arbitrary accuracy. A typical choice for this are neural net- 
works or deep nets. See Sect. 8.5.1 for some comments. Alternatively one can define 
g non-parametrically as belonging to a certain (possibly infinite dimensional) func- 
tion class. This leads to kernel methods, like regularization networks, and Gaussian 
Process inference, treated in the next section. 

Both in the case of grey and black-box models, nonlinear identification is char- 
acterized by considerable structural uncertainty. This leads typically to parametric 
models with many parameters and regularization will be a natural and useful tool to 
handle that. This chapter will discuss typical use of regularization for various tasks 
in nonlinear system identification. 


8.2 Kernel-Based Nonlinear System Identification 


Consider the measurements model 
yt) = Pa +e), i=1,...,N, (8.2) 


where y(t;) is the system output at instant t;, corrupted by the noises e(t;), and f° is 
the unknown function to reconstruct. The link with nonlinear system identification 
is obtained by assuming that the x; contains past input and/or output values, i.e., 


Xi = [uts 1 Un—2 --. Ut;—m, Yti1 Yy—2 ++: Yy—my d (8.3) 


In this way, the function f° represents a dynamic system. For the sake of simplicity, 
letm = m, = my, where m will be called the system memory in what follows. Then, 
ifm < œ a nonlinear ARX (NARX) model is obtained. A nonlinear FIR (NFIR) is 
instead obtained when x; contains only past inputs, i.e., 


Xj = [Uun] U-22 «.- Ur—m). (8.4) 
Now, with these correspondences, we can assume that our nonlinear predictor belongs 


to a function class # given by a RKHS. Then, given the N couples {x;, y(¢;)}, the 
regularization network 
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N 
ra i 2 2 
f =argmin °C) — SED? + vil fll (8.5) 
JEH i=l 


implements regularized NARX, with f : R?™” — R, or NFIR, with f : R” > R. 

To obtain the estimate Ô we can now exploit Theorem 6.15, i.e., the representer 
theorem. Since we focus on quadratic loss functions, the results in Sect. 6.5.1 ensure 
that our system estimate Ô not only exists and is unique but is also available in closed 
form. In particular, let Y = [y(t1), ..., y(ty)]’ and K € RY*” be the kernel matrix 
such that K;; = K (x;, x;). The nonlinear system estimate is then sum of the N kernel 
sections centred on the x;, i.e., 


N 
f=) 4% (8.6) 
i=1 


with coefficients ĉ; contained in the vector 
é=(K+yly)'Y, (8.7) 


with Zy the N x N identity matrix. 

For future developments, in the remaining part of this section it is useful to cast 
the connection between regularization in RKHS and Bayesian estimation in this 
nonlinear setting. Some strategies for hyperparameters tuning will be also recalled. 


8.2.1 Connection with Bayesian Estimation of Gaussian 
Random Fields 


First, we recall an important result obtained in the linear setting in Sect.7.1.4. The 
starting point was the measurements model 


y(t) = Liig] + elt), i=1,...,N, 


with g? denoting the system impulse response and L;[g°] representing the convolu- 
tion between g° and the input, evaluated at ¢;. Proposition 7.1 said that, if 7 is the 
RKHS induced by a kernel K, then 


N 
ĝ = argmin X (yj) — Liig? + vilglle 
8e in 


is the minimum variance impulse response estimator when the noise e is white and 
Gaussian while g° is a zero-mean Gaussian process (independent of e) of covariance 
proportional to K, i.e., 
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&(g°(t)g°(s)) x K(t, s). 


So, the choice of K ensures that the probability is concentrated on our expected 
impulse responses. For instance, in previous chapters we have seen that the TC/stable 
spline class describes time-courses that are smooth and exponential decaying with a 
level established by some hyperparameters. A very simple approach to understand the 
prior ideas introduced in the model is to simulate some curves that will thus represent 
some of our candidate impulse responses. As an example, some realizations from 
the discrete-time TC kernel (7.15), given by K(i, j) = a@™*“/) with a = 0.9, are 
reported in the left panels of Fig. 8.1. 

Consider the nonlinear scenario with measurements model given by (8.2) and 
input locations containing past inputs and outputs. The fundamental difference w.r.t. 
the linear setting is that the unknown function f° now represents directly the nonlin- 
ear input-output relationship. The connection with Bayesian estimation is obtained 
thinking of f° as a nonlinear stochastic surface, in particular a zero-mean Gaussian 
random field. This is a generalization of a stochastic process over general domains: 
one has that, for any set of input locations cag ae the vector [f OCF) a A 0(x>)] 
is jointly Gaussian. In particular, the covariance of such vector is assumed to be 
proportional to the kernel matrix K whose (i, j)-entry is K;; = K (xř, x7). This cor- 
responds to saying that f° is a zero-mean Gaussian random field with covariance AK , 
with A a positive scalar, independent of the white Gaussian noises e(t;) of variance 
o2. Then, 


N 2 
f=agmin OW) — fad’ +rifhe y= 
ae, À 
turns out to be the minimum variance estimator of the nonlinear system f°. In this 
stochastic scenario, our model assumptions can be better understood by simulating 
some nonlinear surfaces from the prior. They will represent some of our candidate 
nonlinear systems. As an example, some realizations from the Gaussian kernel (6.43), 
given by K(x, a) = exp(—||x — al|?/p) with p = 1000, are reported in the right 
panels of Fig. 8.1. It is apparent that such covariance includes just information on the 
smoothness on the input—output map, i.e., the fact that similar inputs should produce 
similar system outputs. 


8.2.2 Kernel Tuning 


As already discussed, e.g., in Sect. 3.5, even when the structure of a kernel is assigned, 
the estimator (8.5) typically contains unknown parameters that have to be determined 
from data. For example, if the Gaussian kernel exp(—||x — al||?/p) is adopted the 
unknown hyperparameter vector 7 will contain the regularization parameter y, the 
kernel width p and possibly also the system memory m. We now briefly discuss esti- 
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Fig. 8.1 Left panels Realizations of a stochastic zero-mean Gaussian process modelling discrete- 
time impulse response candidates. They are drawn by using the TC kernel 0.9™**“-) as covariance. 
Right panels Realizations of a zero-mean Gaussian surface (random field) representing nonlinear 
systems candidates, in particular NFIR models with memory m = 2 in (8.4). They are drawn by 
using the Gaussian kernel exp(—||x — y||7/1000) as covariance 


mation of ņ just pointing out some natural connections with the techniques illustrated 


in Sect. 7.2 in the linear scenario. 


An important observation is that, when a quadratic loss is adopted, even in the 
nonlinear setting the estimator (8.5) leads to predictors linear in the data Y. In addi- 
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tion, since we assume data generated according to (8.2), direct noisy measurements 
of f are available. Hence, the output kernel matrix O used in Sect.7.2 just reduces 
to the kernel matrix K computed over the x; where data are collected. In fact, from 
(8.6) and (8.7) one can see that the predictions j;, i.e., the estimates of the f 0x), 
are the components of Kĉ. So, they are collected in the vector 


¥(n) = KMK) +y In) Y. (8.8) 


Now, consider techniques like SURE and GCV that see f° as a deterministic function 
so that the randomness in Y derives only from the output noise. Exploiting the same 
line of discussion reported in Sects. 3.5.2 and 3.5.3 (see also Sect. 7.2), from (8.8) we 
see that the influence matrix is given by K(n)(K(n) + yIy)~!. Hence, the degrees 
of freedom are 

dof(n) = trace(K(n)(K(n) + yIy)~"). (8.9) 


Then, the SURE estimate of 7 is obtained by minimizing the following unbiased 
estimator of the prediction risk 


faena (VP Opi eee (8.10) 
nel N N 
while the GCV estimate is 
Y — Y(n)|I2 
E | ll (8.11) 


ner (1 — dof(n)/ N)?’ 


where we have used I” to denote the optimization domain. 

If we instead consider the Bayesian framework discussed in the previous sub- 
section, we see f 0 as a zero-mean Gaussian random field of covariance 4K, with 
À a positive scale factor, independent of the white Gaussian noise of variance o°. 
Since y(t;) = f°(x;) + e(t;), following the same reasonings developed in the finite- 
dimensional context in Sect. 4.4, one obtains that the vector Y is zero-mean Gaussian, 
i.e., 


Y ~ N0, Z(n)) 


with covariance matrix 
Z(n) = AK(n) + 07 In. 


Above, the vector 7 could, e.g., contain A, 0”, m and also other parameters entering 
K. Then, we easily obtain that its marginal likelihood estimate is 


| = arg min Y’Z(n)'Y + log det(Z(n)). (8.12) 
ne 
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8.3 Kernels for Nonlinear System Identification 


In the previous section we have cast the kernel-based estimator (8.5) in the framework 
of nonlinear system identification. We have also provided its Bayesian interpretation 
and recalled how to estimate the hyperparameter vector n when the parametric form 
of K is assigned. But the crucial question is now the regularization design. This is 
a fundamental issue, initially discussed in Sect.3.4.2, which in this setting consists 
of choosing a kernel structure suited to model nonlinear dynamic systems. Two 
interesting options come from machine learning literature. The first one is the (already 
mentioned) Gaussian kernel 


2 
=||x=al|* 


K(x,a)=e °’ 


that can describe input-output relationships just known to be smooth. We have also 
seen in Sect.6.6.5 that this model is infinite dimensional, i.e., its induced RKHS 
cannot be spanned by a finite number of basis functions. It is also universal, being 
dense in the space of all continuous functions defined on any compact subset of 
the regressors’ domain. These appear attractive features when little information on 
system dynamics are available. 

A second alternative is the polynomial kernel 


K(x,a) =((x,a)2+1)?, peN, (8.13) 


where (-, -)2 is the classical Euclidean inner product. In the NFIR case, where the 
input locations x; € R” as given in (8.4), such kernel has a fundamental connection 
with the Volterra representations of nonlinear systems, see, e.g., [35]. In fact, we know 
from Sect. 6.6.4 that the induced RKHS is not universal but has dimension ("*”) and 
contains all possible monomials up to the pth degree. Hence, the polynomial kernel 
implicitly encodes truncated discrete Volterra series of the desired order. It avoids 
curse of dimensionality since the possibly large number coefficients have not to be 
computed explicitly thanks to monomials’ encoding. In fact, from (8.7) one can see 
that estimation complexity, even if cubic in the number N of output data turns out to 
be linear in the system memory m and independent of the degree p of nonlinearity. 


8.3.1 A Numerical Example 


We will consider a numerical example where the Gaussian and the polynomial kernel 
are used to estimate a nonlinear dynamic system from input—output data. 
Consider the NFIR 
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Fig. 8.2 Coefficients g? Linear part of the system: impulse response 
defining the linear part of the | | | 
system (8.14). They 
represent the impulse 
response of a stable linear 
system obtained by randomly 
generating a rational transfer 
function of order 10 


20 40 60 80 100 
Time 
80 
f°) = | YE glumi | — urous — 0.25u?_4 + 0.2512 + 
i=l 
+ 0.75u}_; + 0.5 (u7_, + Ur-1U-3 + Ur—2Ur-4) (8.14) 


with nonlinearities taken from [40] while the coefficients g? are reported in Fig. 8.2. 
The inputs are independent Gaussian random variables of variance 4. The measure- 
ments model is that reported in (8.2) with the noise e white and Gaussian of variance 
4 and independent of u. Such system is strongly nonlinear: the contribution of the 
linear part (defined by the 8g?) to the output variance is around 12% of the overall 
variance. 

We generate 2000 input—output couples and display them in Fig. 8.3. The first 
1000 input-output couples {uk, Yk} 2 are the identification data while the other 
1000 {ux, ye}, are the test set. They are used to assess the performance of an 
estimator in terms of the prediction fit 


1 
a 972 2000 
Seidi lye — Sel? | z 1 
id k 0 j= 2 ye, (8.15) 
2000 7 
Dx=t001 [Yk — YI? 1000 oi 


where the ĵų are the predictions returned by a certain estimator by assuming null 
initial conditions, i.e., computed by using only {u;,}72%% , and setting to zero the 
inputs falling outside the test set. 

First, consider the estimator (8.5) equipped with either the Gaussian or the poly- 
nomial kernel with input locations 


Xi = [un Ut—2 --- uy—m), 
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Fig. 8.3 Input and output data generated by the nonlinear system (8.14). The first 1000 couples 
(black line) are used as identification data while the other 1000 (red) are the test set used to assess 
the prediction performance of a model 


where the system memory m is seen as an hyperparameter to be estimated from data. 
Specifically, when using the Gaussian kernel 


2 
=||x=al|* 


K(x,a)=e °’? 
the estimator depends on the unknown hyperparameter vector 


n=I|m y p], 


where m is the system memory, y is the regularization parameter and p is the kernel 
width. Instead, when using the polynomial kernel 


K(x, a) = ((x,a) +)’, peN, 


we have 
n=[m y pl, 


where, in place of p, the third unknown hyperparameter is the polynomial order p. 
In both the cases, we estimate 7 by using an oracle. In particular, assigned a certain 
n, the estimator (8.5) determines f by using only the identification data but the 
oracle has access to the test set to select that hyperparameter vector that maximizes 
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Fig. 8.4 Test set data (red Gaussian kernel 
line), extracted from the last 7 7 r r - r 


1000 outputs visible in the 100 + 
right panel of Fig. 8.3, and 
predictions returned by (8.5) 
equipped with the Gaussian 50 + 
kernel (top panel, black) and 

the polynomial kernel 

(bottom panel, black). The 0 
estimators use the first 1000 
input—output couples in 

Fig. 8.3 as training data, with -50 L 
hyperparameter vector n 
tuned by an oracle that 
maximizes the test set fit -100 | 
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the prediction fit (8.15). Note that calibration is quite computational expensive. In 
fact, one has to introduce a grid to account for the discrete nature of the system 
memory m. The polynomial kernel requires also the introduction of another grid for 
the polynomial order p. 

Figure 8.4 reports some test set data (red line) extracted from the last 1000 outputs 
displayed in the right panel of Fig. 8.3. When adopting the Gaussian kernel, the oracle 
chooses m = 4. When using the polynomial kernel it selects m = 6 and sets the 
polynomial order to p = 3. The top panel of Fig. 8.4 shows the predictions returned by 
the oracle-based Gaussian kernel (black line). The prediction fit is not so large, equal 
to 69.6%. The bottom panel instead plots results from the oracle-based polynomial 
kernel (black line). The prediction capability increases to 73.5% but does not appear 
so satisfactory. Figure 8.5 also reports the MATLAB boxplots of 100 prediction fits 
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Fig. 8.5 Boxplots of 100 
predictions fits (8.15) 
obtained after a Monte Carlo 
study by the oracle-based 75} 
estimators equipped with the | 
Gaussian and polynomial | 
kernel | 
70t 
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returned by the two kernel-based estimators after a Monte Carlo study. At any of the 
100 runs new realizations of inputs and noises define a new identification and test 
set. One can see that, on average, the polynomial kernel performs a bit better than 
the Gaussian kernel, but its mean prediction fit is around 72%. 


8.3.2 Limitations of the Gaussian and Polynomial Kernel 


From (8.14) one can see that the NFIR order is m = 80 while the oracle sets m = 4 
and m = 6 when using, respectively, the Gaussian and the polynomial kernel. This 
introduces a bias in the estimation process that is clearly visible in the predictions 
reported in Fig. 8.4. Let us try to understand the reasons of this phenomenon. 


Polynomial kernel First, consider the polynomial kernel. The oracle chooses the cor- 
rect polynomial order p = 3 to account for the highest-order term 0.75u3_, present 
in the system. Such choice however already defines a complex model since it includes 
all the monomials up to order 3. In particular, with m = 6 and p = 3 the number of 
adopted basis functions is 


(3) _ ey. — 84, 
p 3 


that is quite large considering that 1000 outputs are available. If m is increased to 7, 
one would implicitly use 


7+3 
a . ) = 120 
p 3 
basis functions. In general, values of m larger than 6 are not acceptable for the oracle: 
even a careful tuning of the regularization parameter y does not permit to have a 
good control on the estimator’s variance. This is illustrated in Fig. 8.6 that displays 


the best prediction test set fit that can be obtained by the oracle as a function of 
the system memory m. The maximum is indeed obtained with m = 6. Instead, the 
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Fig. 8.6 Predictions fits by Prediction fits by Polynomial kernel+Oracle 
the oracle-based estimator 80 1 r r 


equipped with the 
polynomial kernel as a 
function of system memory 
m. The optimal model 
dimension is achieved for 
m=6 


o ie i , , 


0 20 40 60 80 
System memory m 


value m = 80 leads to a very small fit, around 25%, because this introduces an overly 
complex model with 
80+ 3 
en =( = ) = 91881 
p 3 
monomials. 


Another reason that does not allow the polynomial kernel to well control model 
variance is the way it regularizes (implicitly) the monomial coefficients. We describe 
this point through a simple example. A quadratic polynomial kernel is considered 
but similar considerations would still hold by introducing larger degrees. Let p = 2, 
xX = [ü] ... U;-m) anda = [ur] ... Ur-m]. Exploiting the multinomial theorem 
one obtains 


K (x,a) = O +9? 
i=l 
m m i-l m 
= bm u u; +2 >» X O urit j) (Uriuri) +2 5 Uriuri + 1. 
i=l 


i=2 j=l i=l 


This defines the following diagonalized version of the quadratic polynomial kernel 


K(x, a) = D> Gpi(x)pi(a), 


where the p;(x) are all the monomials up to degree 3 contained in the following 
vector 
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Fig. 8.7 Some realizations from a zero-mean stochastic process with covariance given by a Gaus- 
sian kernel (left panel) and by a Gaussian plus linear kernel (right) 
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According to the RKHS theory described in Sect.6.3, for any f in the RKHS # 
induced by such kernel one has 


2 
fO= Daa, Ife => > 


i 


where all the eigenvalues ¢; assume value | or 2 (most of them are equal to 2). 
Hence, one can see that the regularizer || f lee does not incorporate any fading mem- 
ory concept typical of dynamic systems. In fact, the two coefficients of the monomials 
(u i u} or those of the couple {ur—-mUt-m+1, U;-2U+-1} are assigned the same 
penalty. But, similarly to the linear case, one should instead expect that inputs u,_; 
have less influence on y; as the positive lag į increases. 


Gaussian kernel As in the case of the polynomial model, one of the limitations 
of the Gaussian kernel K (x, a) = exp(—||x — al|?/p) in modelling nonlinear sys- 
tems is that it does not include any fading memory concept. Hence, the inputs 
{U;—1, Ut—2, ---, Ur-m} included in the input location are expected to have the same 
influence on y,. This can be appreciated also through the Bayesian interpretation of 
regularization, e.g., by inspecting the system realizations generated by the Gaussian 
kernel reported in the right panels of Fig. 8.1. 
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Gaussian kernel estimate Gaussian kernel+linear trend 
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Fig. 8.8 True function (red line), noisy data and regularized estimate returned by (8.5) by using 
a Gaussian kernel K (u, a) = exp(—(u — a} /500) (left panel, black) and a Gaussian plus linear 
kernel K (u, a) = exp(—(u — a) /500) + 10ua (right, black). The regularization parameter y is 
estimated from data via marginal likelihood optimization (8.12) 


Still adopting a stochastic viewpoint, another drawback is that the covariance 
exp(—||x — aļl?/p) describes stationary processes and this implies that the vari- 
ance of f°(x) does not depend on the input location. This is now illustrated in the 
one-dimensional case where x € R and the kernel models a static nonlinear system 
f°(u), i.e., the (noiseless) output y depends only on a single input value u. The 
left panel of Fig.8.7 plots some realizations from exp(—(u — a)? /500). They can 
be poor nonlinear system candidates since a nonlinear system, like that reported in 
(8.14), often contains also a linear component. For this reason it can be useful to 
enrich the model with a linear kernel. Its effect can be appreciated by looking at 
the realizations plotted in the right panel of Fig. 8.7 that are now drawn by using 
exp(—(u — a)? /500) + ua/400 as covariance. 

The fact that the predictive capability of a nonlinear model can much improve 
by adding a linear component can be understood also considering Theorem 6.15 
(representer theorem). Using only a Gaussian kernel, the estimate f of the nonlinear 
system returned by (8.5) is the sum of N Gaussian functions centred on the x;. Hence, 
in the regions where no data are available, the function f just decays to zero and this 
can lead to poor predictions when, e.g., a linear component is present in the system. 
This phenomenon is illustrated in the left panel of Fig. 8.8. In this case, the prediction 
performance can be greatly enhanced by adding a linear kernel, whose results are 
visible in the right panel of the same figure. 


8.3.3 Nonlinear Stable Spline Kernel 


We will build a kernel .% for nonlinear system identification, namely the nonlinear 
stable spline kernel, by exploiting what has been learnt from the previous example. To 
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simplify exposition, we consider the NFIR case but all the ideas here developed can 
be immediately extended to NARX models, as discussed at the end of this section. 
First, it is useful to define % as the sum of a linear and a nonlinear kernel, i.e., 


H (xi, xj) = Aix} Pxj + AwLK (xi, xj), (8.16) 
where the input locations are here seen as column vectors, i.e., 


T 
Xi = [un Ut—2 --- Ut,—m| , 


P €R”*” is a symmetric positive semidefinite matrix that models the impulse 
response of the system’s linear part while K describes the nonlinear dynamics. Note 
that the two-scale factors àz and Ay, are unknown hyperparameters that balance the 
contributions of the linear and nonlinear part to the output. 

For what concerns P, such matrix can be defined by resorting to the class of stable 
kernels developed in the previous chapters. In particular, using the TC/stable spline 
kernel, the (a, b)-entry of P is 


Py =a”, O<a, <1, a=1,...,m, b=1,...,m, (8.17) 


where œz determines the decay rate of the impulse response governing the linear 
dynamics. 

For what concerns K, we will define it by modifying the classical Gaussian kernel 
in order to include fading memory concepts. Following the same ideas underlying 
the TC kernel, we include the information that u,_; is expected to have less influence 
on y; as i increases by defining 


m 


Ut—k — Uz. — 2 
K (xi, xj) = exp ( — y a. O<ay,<1. (8.18) 
p 
k=1 


The additional hyperparameter ay; gives the information that past inputs’ influence 
decays exponentially to zero. To understand how this kernel models the nonlinear 
surface, and how different values of ay; can describe different system features, 
we can use the Bayesian interpretation of regularization. In particular, consider an 
example with m = 2, so that the components of x; are u;,—; and u;,—2, and let the 
system f° be a zero-mean Gaussian random field with covariance given by (8.18) 
with p = 1000. Ifay~, = 1 we recover the Gaussian kernel. Hence, before seeing any 
data, u;,_; and u;,-2 are expected to have the same influence on the system output. 
This can be appreciated by drawing some realizations from such random field, e.g., 
see the top panel of Fig. 8.9 (or the right panels of Fig. 8.1). 
With ay , very close to zero, the output depends mainly on u,,_1, ie., 


(uy,—1 — -a 


K (xj, xj) ~ exp ( 
p 
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This can be appreciated by looking at the realization in the middle panel of Fig. 8.9 
obtained with ay; = 0.001. One can see that, for fixed u;,-1, changes in u;,-2 do 
not produce appreciable variations in the function value. If the value of ay; is now 
increased, the input value u,—2 starts playing a role. This is visible in the bottom 
panel where the realization is now generated by using ay, = 0.1. 

The nonlinear stable spline kernel enjoys also an advantage related to computa- 
tional issues. Using classical machine learning kernels, like Gaussian or polynomial, 
the choice of the dimension m of the input space is a delicate issue. It requires discrete 
tuning, as encountered in classical linear system identification to estimate, e.g., FIR 
or ARX order, and this can be computationally expensive. In the case of the poly- 
nomial kernel, another discrete parameter is the polynomial order p that requires 
an additional grid. By introducing stability/fading memory hyperparameters, one 
can instead set m to a large value increasing the flexibility of the estimator. Then, 
estimation of œz; and ay, from data permits to control the “effective” dimension of 
the regressor space in a continuous manner. In light of the continuous nature of the 
optimization domain, one needs to solve only one optimization problem, involving, 
e.g., SURE (8.10), GCV (8.11) or Empirical Bayes (8.12). 

Finally, as already mentioned, the extension to NARX models is very simple. Let 
iS [af bry" with 


q= [u-i Uy,—-2 «-- en ae bj = [y,-1 Mtj—-2 +++ Visa > 
Then, the kernel (8.16) can be modified as follows 
KH (xi, Xj) = had} Paj + dpb} Prb; + AcKe(a;, aj) Ka(bi, bj) (8.19) 


with the matrices P, and P, defined by the TC kernel (8.17), with possibly different 
decay rates œz, and the nonlinear kernels K. and K4 defined by (8.18), with possibly 
different decay rates ay. A possible variation is 


KH (xi, Xj) = daa} Paj + dpb) Pybj + AcKe(aj,aj;) + raKa(bi, bj), (8.20) 


where the nonlinear dynamics are no more product, as in (8.19), but instead sum of 
nonlinear functions which depend on either past inputs or past outputs. In fact, recall 
from Theorem 6.6 that sums and products of kernels induce well-defined RKHSs 
containing, respectively, sums and products of functions belonging to the spaces 
associated to the single kernels. 
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Nonlinear system realizations 
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Fig. 8.9 Realizations from a zero-mean Gaussian random field having covariance 


exp ( i ((uy-1 urj)—1)" + AN (Uy,—2 — uj-2)?) ) for three different values of «yL 
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Fig. 8.10 Test set data (red Nonlinear Stable Spline 
line), extracted from the last 1 r r 1 r 


1000 outputs visible in the 100 + 
right panel of Fig. 8.3, and 
predictions (black) returned 
by (8.5) equipped with the 50L 
nonlinear stable spline kernel 
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couples in Fig. 8.3 as training 
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optimization 
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8.3.4 Numerical Example Revisited: Use of the Nonlinear 
Stable Spline Kernel 


Let us now reconsider the numerical example where the nonlinear system (8.14) is 
used to generate the identification and test data reported in Fig. 8.3. Now, we use 
the estimator (8.5) equipped with the nonlinear stable spline kernel (8.16). System 
memory is set to m = 100. Hence, we let œz and ay; determine from data which 
past inputs mostly influence the output due to the linear and nonlinear system part, 
respectively. In particular, the hyperparameter vector n = [Àz Aw, æL Anz Pp] is 
estimated via marginal likelihood maximization using the 1000 input—output training 
data. 

Figure 8.10 shows the same test set data (red line) reported in Fig. 8.4 and extracted 
from the last 1000 outputs visible in the right panel of Fig. 8.3. The predictions (black 
line) returned by the nonlinear stable spline kernel are now very close to truth. The 
prediction fit is around 90%. Comparing these results with those in Fig. 8.4, one can 
see that the prediction performance is much better than that of the Gaussian and 
polynomial kernel. Recall also that these two estimators tune complexity by using 
an oracle that is not implementable in practice. Figure 8.11 also plots the MATLAB 
boxplots of 100 prediction fits returned after a Monte Carlo study of 100 runs by 
these two oracle-based estimators, already present in Fig. 8.5, and by nonlinear stable 
spline. One can see that the use of a regularizer that accounts for dynamic systems 
features largely improves the prediction fits. 
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Fig. 8.11 Boxplots of 100 | ] mie 
predictions fits. The first two 90 | 
on the left are obtained by 
the oracle-based estimators 85+ 
equipped with the Gaussian 
and polynomial kernel. The 80+ 
boxplot on the right is Ea 

| 
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obtained by the nonlinear 75+ 
stable spline kernel with l 
hyperparameters estimated 70+ i l 
by marginal likelihood | 
maximization (which 65+ | =e 
exploits only the —L 
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8.4 Explicit Regularization of Volterra Models 


In what follows, we use C(k, m) to indicate the number of ways one can form 
the nonnegative integer k as the sum of m nonnegative integers. This is the same 
problem as distributing k objects to m groups (some groups may get zero objects). 
By combinatorial theory we have 


(8.21) 


C(k,m) = oa 


m— 1 


We adopt the model description (8.2) and seek a simple representation for the model 
f(x). For simple notation, assume that f is scalar valued with past inputs only, i.e., 
(m, = 0, m, = m) with input location x given by (8.4). A straightforward idea is to 
mimic polynomial Taylor expansion 


P 
fQ)= > EA (8.22) 


k=1 


This innocent-looking function expansion is in fact a bit more complex than it looks. 
The kth power of the m-row vector x is to be interpreted as C-dimensional column 
vector with each element being a monomial of the m-components x (i) of x with sum 
of exponents being k: 


aP = x(1)PED x (2)FE . (mf E (8.23a) 

B(k, p) non negative, such that 5 (k, 2) =k (8.23b) 
£=1 

r=1,2,...,C(k,m). (8.23c) 


In (8.22) g; is to be interpreted as a row vector with C(k, m) elements 
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(1) C(k,m) 
jt aes saa (8.23d) 


The response f(x) is thus made of d(p,m) = Yii C(k,m) contributions 
(“impulse responses”) from each of the nonlinear combinations of past inputs 
a) = PEDE up (8.230) 


r=1,...,C(k,m), k=1,...,p. (8.23f) 


This expansion of the model (8.22) is the Volterra Model discussed, e.g., by [7, 35]. 
It has d(p, m) parameters. The reader may recognize this as an explicit treatment 
of the polynomial kernel (8.13) which does not exploit any basis functions implicit 
encoding and, hence, does not exploit the kernel trick described in Remark 6.3. This 
has also some connections with the explicit regularization approaches for linear 
system identification discussed in Sect. 7.4.4 using, e.g., Laguerre functions. 

So, this model has memory length m and polynomial order p. As p > œ it 
follows that f(x) in (8.22), with possibly the addition of a constant function, can 
approximate any (“reasonable”) function arbitrarily well. This universal approxima- 
tion property is of course very valuable for black box models and created considerable 
interest in Volterra models. However, it is easy to see that the number d (p,m) of 
parameters g increases very rapidly with m and p and that high-order polynomials 
in the observed signals may create numerically ill-conditioned calculations. Hence, 
Volterra models have not been used so much in practical identification problems, 
unless for small values of m and p. 

A remedy for the large number of parameters and ill-conditioned numerics is 
clearly to use regularization. In [4] it is discussed how to regularize the Volterra 
model to make it a practical tool. In short, the idea is the following, illustrated for a 
small example with p = 2. 

We write the model also adding a scalar gọ which accounts for a constant com- 
ponent in the output so that one has 


yt) = go + 81 9) +g” Gle) (8.24a) 
P7 (t) = lult), u(t)... u(tm)] (8.24b) 
gı = 0; m — dimensional column vector (8.24c) 
G2 m x m symmetric matrix, (8.24d) 


where the matrix G2 is formed from gs a gp in the expansion (8.22)—(8.23e). 
The regularized estimation can now be formed as the criterion 


ĝR = arg min ||Y — %70? +87 De (8.25) 
0 


with 
6 = [g0, 07 , 63" (8.26) 
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and 62 is an m(m + 1)/2 dimensional column vector made up from Go, and Y is 
the vector of observed outputs y(t) with t = 1,..., N. The regression vector ®y if 
formed from the components of g(t) in the obvious way. It is natural to decompose 
the regularization matrix accordingly: 


d 0 0 
D=|0D 0 (8.27) 
0 0 D 


and treat the regularization of the constant term, (dọ), the linear term (D,) and the 
quadratic term (D2) in (8.24a) separately. As discussed in Chap. 5, a natural choice of 
regularization matrices is to let them reflect prior information about the corresponding 
parameters. That means that dp can be taken as any suitable scalar. The 6; vector for 
the first-order term describes a regular linear impulse response, and the prior for that 
one can be taken as, e.g., the DC kernel reported in (5.40), i.e., 


Py(i, jf) = e+ ee BS (8.28) 


For the second-order model 0; it is natural to treat the second-order nonlinear term in 
the Volterra expansion as a two-dimensional surface, described by two time-indices T1 
and T so that the parameter at T,, T2 is the contribution to the Volterra sum from u(t — 
Tı) - u(t — t2). This is illustrated in Fig. 8.12. The prior value of this contribution can 
be formed as the product of two kernels built up from responses in a coordinate system 
U, Y after an orthonormal coordinate transformation, corresponding to a rotation 


halti, T2) 


Fig. 8.12 Regularization surface for the second-order term in a Regularized Volterra expansion 
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of 45° of the original t1, t2-plane: 


Poi, j) = c2Py (i, J)Px (i, j) (8.29) 
TAT iH jI 

Py (i, j) = e% re- —— (8.30) 

Py (i, j) = e% I% e-br eoa Ban 


where %; and ¥; refer to the coordinates in the new system. The corresponding prior 
distribution is depicted in Fig. 8.12. As desired, it is smooth and decays to zero in all 
directions. The coordinate change is useful to make the surface smooth over critical 
border lines. 

This regularization was deployed in [36], section “Example 5(a) Black-Box 
Volterra Model of the Brain”. Quite useful results were obtained with a regular- 
ized model with 594 parameters, thanks to the regularization. An extension for the 
regularized Volterra models, based on similar idea, is treated in [41], which also 
provides an EM algorithm to estimate the hyperparameters in the regularization 
matrices. Another development where the ideas developed in [4] are coupled with 
kernels implicit encoding can be found in [8]. 


8.5 Other Examples of Regularization in Nonlinear System 
Identification 


8.5.1 Neural Networks and Deep Learning Models 


There are many other universal approximators fon nonlinear systems f(x) than 
those based on kernels or on the explicit Volterra model (8.22). The most common 
ones are various neural network models (NNMs), see, e.g., [12, 23]. They use sim- 
ple nonlinearities connected in more or less complex networks. The parameters are 
weights in the connections as well as characterizations of the nonlinearities. Like 
Volterra models they are capable of approximating any reasonable system arbitrarily 
well given sufficiently many parameters. This means that the NNM typically has 
many parameters. In simple application there could be hundreds of parameters but 
some applications, especially in the so-called deep model applications, could have 
tens of thousands of parameters [18], see also [9, 11, 13, 43] for deep NARX and 
state-space models. Even if benign overfitting has been sometimes observed also for 
overparametrized models [3, 19, 30], in general regularization is a very important 
tool also for estimating such model. Hence, many tricks are typically included in the 
estimation/minimization schemes. 


£5, Lı penalties They include the traditional weighted £2 and £; norm penalties that 
we discuss in this book, see, e.g., Sect.3.6. For example, all estimation algorithms 
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in the System Identification Toolbox, [22] are equipped with optional weighted £2- 
regularization—also when NNM are estimated. 


Early termination It is common to monitor not only the fit to estimation data in 
the minimization process, but also how well the current model fits a validation data 
set. Then the minimization is terminated when the fit to validation data no longer 
improves, even when the estimation criterion value keeps improving. This early 
termination technique is in fact equivalent to traditional regularization, as shown 
in [38]. 


Dropout or Dilution A special technique common in (deep) learning with NNM is 
to curb the flexibility of the model by ignoring (dropping) randomly chosen nodes in 
the network. This is of course a way to control that the model does not become prone 
to overfitting and provides regularization of the estimation just as the other methods 
in this book, but by a quite different technique. See, e.g., [17, 28] for more details. 


8.5.2 Static Nonlinearities and Gaussian Process (GP) 


A basic problem in nonlinear system identification is to handle estimation of a static 
nonlinear function A(n) from known observations 


{C(t), nlt), t = 1,..., N}, C(t) = h(n(t)) + noise. 


A general way to do this is to apply Gaussian Process (GP) estimation, [29], see 
also Sects.4.9 and 8.2.1. Then A(n) is seen as a Gaussian stochastic process with 
a prior mean (often zero) and a certain prior covariance function K (1, n2). The 
arguments can range both over a discrete and continuous domain. After a number of 
observations z = {¢ (t), n(t), t = 1,..., N}, the posterior distribution of the process 
h? (ņn|z), can be determined for any 7. This is, in short, how the function h can be 
estimated. As seen in Sect. 8.2.1, it corresponds to a kernel method with the kernel 
determined by the prior covariance function K (71, 72). 


8.5.3 Block-Oriented Models 


A very common family of nonlinear dynamic models is obtained by networks of 
linear dynamic models G(q) and nonlinear static functions h(x), see Fig. 8.13. The 
simplest and most common ones are the Hammerstein Model y(t) = G(h(u(t)) 
which is obtained by passing the input through a static nonlinearity before it enters the 
linear system. Wiener model z = G(u), y(t) = g(z(t)), where the output of a linear 
system is subsequently passing through the nonlinearity. The important contribution 
[5] has shown that any nonlinear system with fading memory can be approximated 
by a Wiener model. See also, e.g., [37] for a survey and [42] for a general approach 
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Fig. 8.13 Common 
block-oriented models. 
Green ovals: static 
nonlinearity 4. Red blocks: 
linear dynamic systems. 
From top to bottom: Wiener 
model, Hammerstein model, 
Hammerstein—Wiener model 


O_O 


to Hammerstein—Wiener identification allowing coloured noise sources both before 
and after the nonlinearities (which may be non-invertible). 

Traditionally, the nonlinearities have been parametrized, e.g., as piecewise con- 
stant or piecewise linear, as polynomials or as neural nets. Recently it has been more 
common to work with nonparametric nonlinearities which are typically modelled 
by the GP approach, and whole estimation is then treated in a Bayesian setting. For 
example, in [21] the linear part of a Wiener model is parametrized by state-space 
matrices A, B in an observer canonical form with suitable priors and the output 
nonlinearity h(z) is a Gaussian Process with a prior mean = z (“linear output”) and 
large and “smooth” prior covariance. To obtain the posterior densities, a particle 
Gibbs sampler (PMCMC, Particle Markov Chain Monte Carlo) is employed. 

In [32] the same approach is used to model the output nonlinearity, but the linear 
part is written as an impulse response, with a prior of the same type as discussed in 
Sect. 5.5.1. The whole problem can then be written as 


y = 9(@g), (8.32) 


where y is the observed output, g is the output static nonlinearity, g is the impulse 
response of the linear system and @ is the Toeplitz matrix formed from the input. The 
problem is then to determine the posterior densities p(g|y) and p(g|y) by Bayesian 
calculations. In [31] a similar technique is used for estimating Hammerstein models. 


8.5.4 Hybrid Models 


A common class of nonlinear models are Hybrid models [15, 39]. They change 
their properties depending on some regime variable p(t) (which may be formed 
from the inputs and outputs themselves) [16]. Think of a collection of linear models 
that describe the system behaviour in different parts of the operating space and 
automatically shift as the operating point changes. To build a hybrid model involves 
two steps: (1) find the collection of relevant different models and (2) determine the 
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areas where each model is operative. This is considered as quite a difficult problem, 
and approaches from different areas in control theory have been tested. Here we will 
comment upon a few ideas that relate to regularized identification. 

A basic problem is to decide when a change in system behaviour occurs. This 
relates to change detection and signal segmentation. A regularization based method 
to segment ARX models was suggested in [25]. The standard way to estimate ARX 
models can be described as in Chap. 2: 


N 
min $ lly) — 9 all’. (8.33) 


t=1 


This gives the average linear model behaviour over the time record ¢ € [1..., N]. 
To follow momentary changes over time, we could estimate N models by 


N 
T 2 
min y O= OLONE (8.34) 


A(t), t=1,..., 


This would give a perfect fit with a pretty useless collection of models. To tell that we 
need to be more selective when accepting new models, we can add a £; regularization 
term, discussed in Sect. 3.6, obtaining: 


N N 
mp, WO ClO y SO ee n 83) 


O(t Leg 
(t) pE 


One could also use the norms in £, with p > 1 as regularizers but it is crucial that 
the penalty is a sum of norms and not a sum of squared norms. Then, adopting a 
suitable value for the regularization parameter y, the penalty favours the terms in 
the second sum to be exactly zero and not just small. This will force the number of 
different models from (8.35) to be small and thus just flag when essential changes 
have taken place. 

This idea is taken further in [24] to build hybrid models of PWA (piecewise affine) 
character. The starting point is again (8.34), but now the number of models is reduced 
by looking at all the raw models: 


N N N 
LL y O —e OOP +7 DD KOPIO — IL 
aoe t=1 t=1 s=1 


(8.36) 


Here K (pı, p2) is a weighting based on the respective regime variables p. This gives 
a number of, say, d submodels, and they can then be associated with values of the 
regime variable by a classification step. 
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These ideas of segmentation, building a collection of d submodels and associating 
them with particular values of time are taken to a further degree of sophistication 
in [27]. The idea there is to build hybrid stable spline (HSS) algorithm, based on a 
joint use of the TC (stable spline) kernel, see Sect. 5.5.1, for a family of ARX models 
like (8.34). The classification of the models is built into the algorithm, by letting 
the classification parameters be part of the hyperparameters. An MCMC scheme is 
employed to handle the nonconvex and combinatorial difficulties of the maximum 
likelihood criterion. 


8.5.5 Sparsity and Variable Selection 


In all estimation problems it is essential to find the regressors x(t), where k = 
1,...,d, which are best suited for predicting the goal variable y(t). The variables 
x; can be formed from the observations from the system in many different ways. 
It is generally desired to find a small collection of regressors, and statistics offers 
many tools for this: hypothesis analysis, projection pursuit [14], manifold learn- 
ing/dimensionality reduction [10, 26, 34], ANOVA, see, e.g., [20] for applications 
to nonlinear system identification. 

The problem of variable (regressor) selection can be formulated as follows. Given 
a model with n candidate regressors x; (ft) 


y(t) = f(t), ..., Xn) + et) (8.37) 


find a subselection or combination of regressors x; (t), .. . , xa (t) that gives the best 
model of the system. Note that the NARX model (8.3) is a special case of (8.37) 
with x(t) = [y(t — k), u(t — k)]. In principle one could try out different subsets of 
regressors and see how good models (in cross validation) are produced. That would 
in most cases mean overwhelmingly many tests. 

Instead the £;-norm regularization discussed in Sect. 3.6.1, leading to LASSO in 
(3.105), is a very powerful tool for variable selection and sparsity. In what follows 
each x; (t) is scalar and is the ith component of the n-dimensional vector x(t). Then, 
for a linearly parametrized model 


y(t) = Bii t) +--- + Bnn (t) + elt), (8.38) 


where the best regressors are found by the criterion 


N 
sae > ly) = POBI? + yIBlh (8.39) 
B = [B1, B2, -+ -> Bal” (8.40) 


P(t) = [X1 (1), X2(t),..., Xn O]. (8.41) 
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This idea to use £;-norm regularization was extended to the a moe! (8.37) 
in [2]. It is based on the idea to estimate the partial derivatives 6, = = in a 37) 


analogously to (8.39). In particular, the Taylor expansion of f (x(t)) can x? is 
0 
FAO) = FO) + OO - ye + (lx) — xP). (8.42) 


The partial derivative is evaluated at x° and is a column vector of dimension n with 
row k given by the derivative w.rt. x,. As anticipated, denote this by 6. These 
parameters can be estimated by least squares with 


pe IXO = æ = E) = xB? KOO = x°) + IIB, (8.43) 


where œ corresponds to f(x°), B is the vector of partial derivatives 6, and K is a 
kernel that focuses the sum to points x(t) in the vicinity of x°. The £, norm regu- 
larization term is added just as in (8.39) to promote zero estimates of the gradients. 
This will focus on selecting regressors x; that are important for the model. 

With the so-called iterative reweighting, [6], the regularization term can be refined 
to 


y >> wel Bel, (8.44) 


where w = 1/ [Bl are based on the estimates from (8.43). This refinement is sug- 
gested to be included in the algorithm of [2]. 
Note that this test depends on the chosen point x°. It will be a big task to investigate 
fing > such pe In [1] it is instead suggested to estimate the expected values 
Ex; ie F and Ew “Lf This is done using the pdfs for x; given by p; (u) and diw “ which can 
be estimated iy simple density estimation (involving only a scalar randori variable). 
A comprehensive study of sparsity and regularization is made in [33]. It works 
with a more complex model definition, allowing f : R” — R to be defined over 
several Hilbert spaces. The bottom line is still based on €;-norm regularization of 
partial derivatives and the final learning algorithm is given by minimization of a 
functional 


N 


= 0 sorry YE], + vie). (8.45) 


Here, # can be a RKHS, the penalty on each partial derivative is given by 
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FEE) 


t=1 


liz 


y is the regularization parameter and v is a small positive number to ensure stability 
and strongly convex regularizer. 
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Chapter 9 A) 
Numerical Experiments and Real World sci 
Cases 


Abstract This chapter collects some numerical experiments to test the performance 
of kernel-based approaches for discrete-time linear system identification. Using 
Monte Carlo simulations, we will compare the performance of kernel-based methods 
with the classical PEM approaches described in Chap. 2. Simulated and real data are 
included, concerning a robotic arm, a hairdryer and a problem of temperature predic- 
tion. We conclude the chapter by introducing the so-called multi-task learning where 
several functions (tasks) are simultaneously estimated. This problem is significant if 
the tasks are related to each other so that measurements taken on a function are infor- 
mative with respect to the other ones. A problem involving real pharmacokinetics 
data, related to the so-called population approaches, is then illustrated. Results will 
be often illustrated by using MATLAB boxplots. As already mentioned in Sect. 7.2, 
when commenting Fig.7.8, the median is given by the central mark while the box 
edges are the 25th and 75th percentiles. The whiskers extend to the most extreme 
fits not seen as outliers. Then, the outliers are plotted individually. 


9.1 Identification of Discrete-Time Output Error Models 


In this section, we will consider two numerical experiments with data generated 
according to the discrete-time output error (OE) model 


y(t) = Go(q)u(t) + elt), 


where Go is a rational transfer function while e is white Gaussian noise independent 
of the known input u. Using simulated data, we will compare the performance of 
the classical PEM approach, as described in Chap.2, with some of the regularized 
techniques illustrated in this book. In particular, we will adopt regularized high-order 
FIR, with impulse response coefficients contained in the m-dimensional (column) 
vector @ and the output data in the (column) vector Y = [y(1)... y(V)]". So, letting 
the regression matrix ® € RY”™ be 
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u(0) u(—1) u(—2) ... u(—m + 1) 
u(1) u(0) u(—1) ... + =u(—m) 


u(N - 1) u(N — 2) u(N — 3)... u(N — m) 


our estimator is 


6 = arg min ||Y — 6||? + yo? P-'6 (9.1a) 
0 

= P&'(@P@' + yly)'Y; or (9.1b) 

= (PET + yI) 'PO@'Y. (9.1c) 


We have already seen in (5.40), (5.41) and (7.30), using MaxEnt arguments and 
spline theory, that choices for the regularization matrix P can be the first- or second- 
order stable spline kernel, denoted respectively by TC and SS, respectively, or the 
DC kernel. They are recalled below specifying also the hyperparameter vector n: 


TC P(n) =a), 
4>0, 0<a<1, n=[A, a], (9.2) 


okt itmax(k, j) a? max(k, j) 
SS Pyn) =A 5 i 


A>0,0<a<1, n=[), a], (9.3) 


DC Py (n) = Aa*+)/2 pli, 
A200, 0<a<1,/e| <1, n= [å,«, p]. (9.4) 


9.1.1 Monte Carlo Studies with a Fixed Output Error Model 


In this example the true impulse response is fixed to that reported in Fig. 8.2, obtained 
by random generation of a rational transfer function of order 10. It has to be estimated 
from 500 input-output couples (collected with system initially at rest). The input is 
white noise filtered by the rational transfer function 1/(z — p) where p will vary 
over the unit interval during the experiment. Note that p establishes the difficulty 
of our system identification problem. Values close to zero make the input similar 
to white noise and the output data informative over a wide range of frequencies. 
Instead, values of p close to 1 increase the low-pass nature of the input and, hence, 
the ill-conditioning. The measurement noise is white and Gaussian with variance 
equal to that of the noiseless output divided by 50. Two estimators will be adopted: 
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e Oe+Or. Classical PEM approach (2.22) equipped with an oracle. In particular, 
our candidate models are rational transfer functions where the order of the two 
polynomials is equal and can vary between 1 and 30. For any model order, esti- 
mation is performed through nonlinear least squares by solving (2.22) with £ in 
(2.21) set to the quadratic function. The method is implemented in oe .m of the 
MATLAB System Identification Toolbox. Then, the oracle chooses the estimate 
which maximizes the fit 


100 


1 

Dale- P ao 1 0 

100 | 1 l w r o| peua (9.5) 
yale — 8’! k=l 


where g? are the true impulse response coefficients while g(k) denote their esti- 
mates. The estimator is given the information that system initial conditions are 
null. 

e TC+ML. This is the regularized estimator (9.1), equipped with the kernel TC. The 
number of estimated impulse response coefficients is m = 100 and the regression 
matrix is built with u(t) = O if t < 0. At every run, the noise variance is estimated 
by fitting via least squares a low-bias model for the impulse response. Then, the 
two kernel hyperparameters are obtained via marginal likelihood optimization, see 
(7.42). The method is implemented in impulseest .m of the MATLAB System 
Identification Toolbox. 


We consider 4 Monte Carlo studies of 300 runs defined by different values of p in 
the set {0, 0.9, 0.95, 0.99}. As already mentioned, p = 0 corresponds to white noise 
input while p = 0.99 leads to a highly ill-conditioned problem (output data provide 
little information at high frequencies). Figure 9.1 reports the boxplots of the 1000 
fits returned by Oe+ Or and TC+ML for the four different values of p. Even if PEM 
exploits an oracle to tune complexity, the performance is (slightly) better than TC+ML 
only when the input is white noise, see also Table 9.1. When p increases, the ill- 
conditioning affecting the problem increases and TC+ML outperforms Oe+Or even 
if no oracle is used for hyperparameters tuning. This also points out the effectiveness 
of marginal likelihood optimization in controlling complexity. 

This case study shows that continuous tuning of hyperparameters may be a more 
versatile and powerful approach than classical estimation of discrete model orders. 
A problem related to PEM here could be also the presence of local minima of the 
objective. This is much less critical when adopting kernel-based regularization. In 
fact, TC+ML regulates complexity through only two hyperparameters while Oe+ Or 
has to optimize many more parameters (function of the postulated model order). 
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Fig. 9.1 Experiment with a fixed OE-model Boxplots of 300 impulse response fits returned by PEM 
with an oracle to tune discrete model order and by TC with continuous hyperparameters estimated 
by marginal likelihood optimization. Results are function of the level of ill-conditioning affecting 
the problem which increases with p (the input is white Gaussian noise for p = 0 while the other 
values define low-pass inputs) 


Table 9.1 Experiment with a fixed OE-model Average fit, as a function of p, after 300 Monte 
Carlo runs. The value p = 0 corresponds to white noise input and the level of ill-conditioning then 
increases as p increases 


Oe+Or TC+ML 
p=0 95.8 95.3 
p=09 85.2 86.3 
p = 0.95 75.3 83.2 
p=0.99 49.9 74.3 
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9.1.2 Monte Carlo Studies with Different Output Error 
Models 


Now we consider two Monte Carlo studies of 1000 runs regarding identification of 
several discrete-time output error models. The outputs are still given by 


y(t) = Go(q)u(t) + elt) 


with e white Gaussian noise independent of u, but the rational transfer function Go 
changes at any run. In fact, a 30th-order single-input single-output continuous-time 
system is first randomly generated by the MATLAB command rss .m. It is then 
sampled at 3 times of its bandwidth and used if its poles fall within the circle of the 
complex plane with centre at the origin and radius 0.99. 

With the system at rest, 1000 input—output pairs are generated as follows. At any 
run, the system input is unit variance white Gaussian noise filtered by a second-order 
rational transfer function generated by the same procedure adopted to obtain Go. 
The outputs are corrupted by an additive white Gaussian noise with a SNR (the ratio 
between the variance of noiseless output and noise) randomly chosen in [1, 20] at 
any run. In the first experiment, the data set 


Dr = {u(1), y(1),..., u(N), y(N)} 


contains the first 200 input-output couples, i.e., N = 200, while in the second exper- 
iment all the 1000 couples are used, i.e., N = 1000. 

Starting from null initial conditions, at any run we also generate two different 
kinds of test sets 


Diest = {u"” (1), y” (1), ..., u” (M), y” (M)}, M = 1000. 


The first test set is especially challenging since noiseless outputs are generated by 
using unit variance white Gaussian noise as input. In the second test set the input 
has instead the same statistics of that entering the identification data, hence making 
easier its prediction. 

The performance of a model characterized by 6, and returning y"°” (tê ) as output 
prediction at instant f, is 


pee 
ih (vere) — Fre (e18)) 
eo (yew (t) = new)? 


F(6) = 100 | 1 , M=1000, (9.6) 


new 


where y”"*” is the average output in Zest and ĵ”°” (t|6) are computed assuming 
zero initial conditions (otherwise high-order models could have the advantage to 
calibrate the initial conditions to fit F,.5;). The prediction fit (9.6) can be obtained 
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by the MATLAB command predict (model, data,k,’ini’,'’z’) where 
model and data denote structures containing the estimated model and the test set 
Priest, respectively. 

In what follows, we will use also estimators equipped with an oracle which eval- 
uates the fit (9.6) for the test set of interest. Different rational models with orders 
between 1 and 30 are tried and the oracle selects the orders that give the best fit. We 
are now in a position to introduce the following 6 estimators: 


Oe+Orl1. Classical PEM approach (2.22), with quadratic £ in (2.21), equipped 
with an oracle which uses the first test set (white noise input). As said, candidate 
models are rational transfer functions whose order can vary between 1 and 30. For 
any order, the model is returned by the function oe .m of the MATLAB’s System 
Identification Toolbox [14]. 

Oe+Or2. The same procedure described above except that the oracle maximizes 
the prediction fit using the second test set (test input with statistics equal to those 
of the training input). 

Oe+CV. The classical approach now does not use any oracle: model order is 
estimated by cross validation by splitting the identification data into two sets with 
the first and the last N /2 data contained in 2r. The prediction errors are computed 
assuming zero initial conditions. The model order minimizing the sum of squared 
prediction errors (computed assuming zero initial conditions) is chosen. Finally, 
the system estimate is computed using all the data in Yr by solving (2.22) with 
quadratic loss. 

{TC+ML,SS+ML,DC+ML}. These are three regularized FIR estimators of the 
form (9.1) with order 200 and kernels TC (9.2), SS (9.3) and DC (9.4). Marginal 
likelihood optimization (7.42) is used to determine the noise variance and the 
kernel hyperparameters (2 for SS and TC, 3 for DC). The regularized FIR models 
are estimated using the function impulseest.m in the MATLAB’s System 
Identification Toolbox [14]. 


9.1.2.1 Results 


The MATLAB boxplots in Fig. 9.2 contain the 1000 fit measures returned by the 
estimators during the first experiment with N = 200 (left panels) and the second 
experiment with N = 1000 (right panels). Table 9.2 reports the average fit values. 

In the top panels of Fig. 9.2 one can see the fits of the first test set. Recall that 
Oe+Or1 has access to such data to optimize the prediction capability. Interestingly, 
despite this advantage, the performance of all the three regularized approaches is 
close that of the oracle while that Oe+CV is not so satisfactory. This is also visible 
in the first two rows of Table 9.2. 

The bottom panels of Fig. 9.2 show results relative to the second test set which 
is used by Oe+Or2 to maximize the prediction fit. Since training and test data are 
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Table 9.2 Identification of discrete-time OE-models Average fit, after 1000 Monte Carlo runs, as 
a function of the test set type and the identification data set size (V = 200 or N = 1000). Results 
in the first and last column come from oracle-based estimators which cannot be implemented in 
practice 


Oe+Orl =| TC SS DC Oe+CV Oce+Or2 
Ist test set, N = 200 52.7 51.9 48.8 51.1 34.8 —11.9 
lst test set, N = 1000 |66.2 63.4 58.5 63.1 —20.9 28.2 
2nd test set, N = 200 |84.8 86.3 85.9 86.8 72.9 87.8 
2nd test set, N = 1000 | 93.2 92.9 91.8 93.1 88.6 94.2 


more similar, the prediction capability of Oe+CV improves significantly but the 
regularized estimators still outperform the classical approach, see also the last two 
rows of Table 9.2. 


N=200, 1st test set N=1000, 1st test set 
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Fig.9.2 Identification of discrete-time OE-models Top Boxplot of the 1000 prediction fits on future 
outputs with test input given by white noise. The size of the identification data set is 200 (top left) or 
1000 (top right). Bottom Differently from the results in the top panel, input statistics in the estimation 
and test data set are the same. The first and last boxplot contained in the four panels contain results 
from the estimators Oe+Or/ and Oe+Or2 which cannot be implemented in practice 


350 9 Numerical Experiments and Real World Cases 


Input-output data 


1 yi i 
0.5 b J 
OR 4 
0.5} 4 
S i i ' 
E 
È 0.4 l , ui - 
0.2 
0 ks 
0.2 4 
-0.4 i i i , ! 
0.2 0.4 0.6 0.8 1 1.2 


Time (seconds) 


Fig. 9.3 Robot arm A portion of the input—output data for the robot arm: the input is the driving 
couple (bottom) and the output is the tip of the robot arm (top) 


9.1.3 Real Data: A Robot Arm 


Consider now the vibrating flexible robot arm described in [27], where two feedfor- 
ward controller design methods were compared on trajectory tracking problems. The 
input of the robot arm is the driving couple and the output is the acceleration at the tip 
of the robot arm. The input-output data contain 40960 data points. They are collected 
at a sampling frequency of 500 Hz for 10 periods with each period containing 4096 
data points. A portion of the data is shown in Fig. 9.3. The identification problem of 
the robot arm was studied in [23, Sect. 11.4.4] with frequency domain methods. 

We will build models by both the classical prediction error method and the kernel 
method with the DC kernel. Since the true system is unknown, to compare the 
performance of different impulse response estimates we divide the data into two 
parts: the training and the test set, given by the first 6000 input-output couples and 
the reaming ones, respectively. Then, we measure how well the models, built with 
the estimation data, predict the test outputs. 

For the prediction error method, we estimate nth-order state-space models with- 
out disturbance model and with zero initial conditions for n = 1,...,36. This 
method is available in MATLAB’s System Identification Toolbox [13] as the com- 
mand pem(data,n,’dist’,’no’,’ini’,‘z’). The prediction fits com- 
puted using (9.6) are shown as a function of n in Fig. 9.4, respectively. An ora- 
cle that has access to the test set would select the order n = 18, hence obtain- 
ing a prediction fit equal to 79.75%. For the kernel method with the DC ker- 
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Fig. 9.4 Robot arm The solid red line is the fit for the prediction error method for different model 
order n = 1,..., 36. The dash-dot blue line is the prediction fit on the test set for the regularized 
method with the DC kernel 


nel, we estimate a FIR model of high-order 3000 with hyperparameters tuned 
by optimizing the marginal likelihood. When forming the regression matrix, the 
unknown input data are set to zero. The prediction fit (9.6) is 83.07% and is 
shown as a horizontal solid line in Fig. 9.4. The kernel method with the DC ker- 
nel is available in MATLAB’s System Identification Toolbox [14] as the com- 
mand impulseest (data,3000,0,opt) where, in the option opt, we set 
opt.RegulKernel=’dc’; opt.Advanced.AROrder=0. 

The Bode magnitude plot of the models estimated by PEM and the DC kernel 
is shown in Fig. 9.5. The empirical frequency function estimate obtained using the 
command et fe in MATLAB’s System Identification Toolbox [14] is also displayed. 

The measured output and the predicted output over a portion of the test set are 
shown in Fig. 9.6. If one has concern that a FIR model of order 3000 is quite large, 
then one could reduce such high-order model by projecting it to a low-order state- 
space model. Exploiting model order reduction techniques, the fit of a state-space 
model of order n = 25 is 79.8%, still better than the best state-space description that 
can be obtained by PEM. 
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Fig. 9.5 Robot arm Bode magnitude plot of the estimated models obtained by the empirical fre- 
quency function estimate ETFE (black), the regularized method with DC kernel and hyperparame- 
ters estimate by marginal likelihood optimization (blue), and the prediction error method with order 
n = 18 (red) 
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Fig. 9.6 Robot arm A portion of the test set (grey) and the predictions returned by the regularized 
method with DC kernel (blue) and the prediction error method (red) 
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9.1.4 Real Data: A Hairdryer 


The second application is a real laboratory device, whose function is similar to that 
of a hairdryer: the air is fanned through a tube and then heated at a mesh of resistor 
wires, as described in [13, Sect. 17.3]. The input to the hairdryer is the voltage 
over the mesh of resistor wires while the output is the air temperature measured 
by a thermocouple. The input-output data contain 1000 data points collected at a 
sampling frequency of 12.5 Hz for 80 s. A portion of the data is shown in the top 
panel of Fig. 9.7. Since the input-output values move around 5 and 4.9, respectively, 
we detrend the measurements in such a way that they move around 0. The estimation 
and test set data are then given by the first and the last 500 input-output couples, 
respectively. 

As in the case of the robot arm, we build models by the classical prediction 
error method with an oracle, which maximizes the prediction fit, and the regularized 
approach with the DC kernel, with hyperparameters tuned by marginal likelihood 
optimization. For the prediction error method, we estimate nth-order state-space 
models without disturbance model forn = 1, ..., 36 and with zero initial conditions. 
The fits, as a function of n, are shown in Fig. 9.8. The best result is obtained for order 
n = 5 and turns out 88.38%. For the kernel method with the DC kernel, we estimate a 
FIR model with order 70. When forming the regression matrix, we set the unknown 
input data to zero. The prediction fit (9.6) is somewhat close to that achieved by 
PEM+Oracle being equal to 88.15%. It is shown as a dash-dot blue line in Fig. 9.8. 
The test set and the predicted outputs returned by the two methods are shown in 
Fig. 9.9. One can see that the regularized approach has a prediction capability very 
close to that of PEM+Oracle. 


9.2 Identification of ARMAX Models 


In this section we consider the identification of linear systems 


p 
y(t) = ps Goi amo] + Holq)elt). (9.7) 


i=1 


Differently from the previous cases, beyond the presence of multiple observable 
inputs u;, also the noise model is unknown. In fact, the e(t) are white Gaussian noise 
of unit variance filtered by a system Ho(q) that has to be estimated from data. 

First, itis useful to cast the identification of the general model (9.7) in a regularized 
context. Without loss of generality, to simplify the exposition, let p = 1 with the 
single observable input denoted by u. Exploiting (2.4), given the general linear model 
(9.7), we can write any predictor as two infinite impulse responses from y and u, 
respectively. When using ARX models, we have seen in (2.8) that such infinite 
responses specialize to finite responses. One has 


354 


Original input-output data 


9 Numerical Experiments and Real World Cases 


yi 


Amplitude 


to 


8 


Time (seconds) 


10 


Fig. 9.7 Hairdryer A portion of the input-output data for the hairdryer. The input is the voltage 
over the mesh of resistor wires (bottom panel) and the output is the air temperature measured by a 
thermocouple (top panel) 
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Fig. 9.8 Hairdryer The solid line is the fit for the prediction error method for different model order 
n=1,..., 36. The dash-dot line is the fit for the ReLS method with the DC kernel 
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Fig. 9.9 Hairdryer The measured output and the predicted output over the test data: the measured 
output (grey), the ReLS method with DC kernel (blue) and the prediction error method (red) 


yt) =- ayy(t — 1) —-+-+— an y(t — ng) + biult —1)+-:-- 
+ bn,u(t — ny) + elt) = p5 (1)Oa + Pa (1) + elt), (9.8) 


where @, = [a Sh an|, 0, = [bı kas bn] and p,(t), pu (t) are made up from y 
and u in an obvious way. Thus, the ARX model is a linear regression model, to which 
the same ideas of regularization can be applied. This point is important since we have 
seen in Theorem 2.1 that ARX-expressions become arbitrarily good approximators 
for general linear systems as the orders na, n, tend to infinity. However, as discussed 
in Chap. 2, high-order ARX can suffer from large variance. A solution is to set 
Ng = Np = N toa large value and then introduce regularization matrices for the two 
impulse responses from y and from u. The P-matrix in (9.1) can be partitioned along 
with 0,, Op: 


P(n, m) = Ea Pa (9.9) 


with P“ (n1), P’ (m) defined, e.g., by any of (9.2)-(9.4). Letting 0 = [67 0f ]" and 
building the regression matrix using [yy (t) gh (t)] as rows, the estimator (9.1) now 
becomes a regularized high-order ARX. The MATLAB code for estimating this 
model using, e.g., the DC kernel would be 

ao=arxRegulOptions (’RegularizationKernel’,’DC’), 
[Lambda,R] = arxRegul (data,na,nb,nk,ao), 
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aropt= arxOptions; aropt.Regularization.Lambda = Lambda, 
aropt.Regularization.R = R, 
m = arx(data,na,nb,nk,aropt). 

We can also easily extend this construction to multiple inputs. Given any generic p, 
one needs to estimate p + 1 impulse responses with the matrix (9.9) now containing 
p + 1 blocks. If there are multiple outputs, one approach is to consider each output 
channel as a separate linear regression as in (9.8). The difference is that now also the 
other outputs need to be appended as done with the inputs. 


9.2.1 Monte Carlo Experiment 


One challenging Monte Carlo study of 1000 runs is now considered. Data are gen- 
erated at any run by an ARMAX model of order 30 having p observable inputs, 
i.e., 


P Biq) | Cq) 
o=] uit) t + — et), 
2 A(q) A(q) 


with p drawn from a random variable uniformly distributed on {2, 3, 4, 5}. Note that 

the system contains p + 1 rational transfer functions. They depend on the polynomi- 

als A, B; and C which are randomly generated at any run by the MATLAB function 

drmodel.m. Such function is first called to obtain the common denominator A 

and the first numerator B,. The other p calls are used to obtain the numerators of 

the remaining rational transfer functions. The system so generated is api J the 
ilq 


modulus of its poles is not larger than 0.95. In addition, letting G; (q4) = AGY and 


H(q4) = oe the signal to noise ratio has to satisfy 


P 2 
seer 


~ IHR 7 


where ||G;|l2, || H ||2 are the £2 norms of the system impulse responses. 

After a transient to mitigate the effect of initial conditions, at any run 300 input- 
output couples are collected to form the identification data set Hr and other 1000 
to define the test set Zest. In any case, the input is white Gaussian noise of unit 
variance. 

Differently from the output error models, in the ARMAX case the performance 
measure adopted to compare different estimated models depends on the prediction 
horizon k. More specifically, let 7°" (tô) be the k-step-ahead predictor associated 
with an estimated model characterized by 6. For any t, such function predicts k-step- 
ahead the test output y”“”(t) by using the values of the test input u”®™ up to time 
t — 1 and of the test output y”*” up to t — k. The prediction difficulty in general 
increases as k gets larger. The special case k = 1 corresponds to the one-step-ahead 
predictor given by (2.4), while see, e.g., [13, Sect. 3.2] for the expressions of the 
generic k-step-ahead impulse responses. 
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As done in (9.6) we use y”*” denote the mean of the outputs in Zest, but now the 
prediction fit depends on k, being given by 


aA \2 
it, (ew — sper cri6)) 
pea (yrew (t) = prew)? 


F, (ô) = 100 , M=1000. (9.10) 


In this case, we say that an estimator is equipped with an oracle if it can use the 
test set to maximize Ee Fp by tuning the complexity of the model estimated using 
the identification data. The following estimators are then introduced: 


e PEM+Oracle: this is the classical PEM approach (2.22) with quadratic loss 
equipped with an oracle. The candidate model structures are ARMAX models 
with polynomials all having the same degree up to 30. For any model order, the 
MATLAB command pem.m (or armax.m) is used to obtain the system’s esti- 
mate. of the MATLAB System Identification Toolbox [14]. 

PEM+CV: in place of the oracle, model complexity is estimated by cross validation 
splitting Žr into two sets containing, respectively, the first and the last 150 input- 
output couples. The model order which minimizes the sum of the squared one-step- 
ahead prediction errors computed with zero initial conditions for the validation 
data is selected. The final system’s estimate is returned by (2.22) using all the 
identification data. 

{PEM+AICc,PEM+BIC}: this is the classical PEM approach with AIC-type cri- 
teria used to tune complexity, as reported in (2.35) and (2.36). 
{TC+ML,SS+ML,DC+ML}: these are the three regularized least squares estima- 
tors introduced at the beginning of this section which determine the unknown 
coefficients of the multi-input version of the ARX model. After setting the length 
of each predictor impulse response to 50, the regularization matrices entering 
the multi-input version of (9.9) are defined by TC (9.2) or SS (9.3) or DC (9.4) 
kernels. The first 50 input-output pairs in Yr are used just as entries of the regres- 
sion matrix. For every impulse response, a different scale factor A and a common 
variance decay rate œ (and, in the case of DC, a correlation p) is adopted. The 
hyperparameters are determined via marginal likelihood optimization. 


All the system inputs delay are assumed known and their values are provided to all 
the estimators described above. 

The average of the fits F, given by (9.10), function of the prediction horizon k, 
is reported in Fig. 9.10. Since PEM equipped with Akaike-like criteria return very 
small average fits, results achieved by this kind of procedures are not displayed. 
The MATLAB boxplots of the 1000 values of 4; and A returned by all the 
estimators are visible in Fig. 9.11. The average fit of SS+ML is quite close to that 
of PEM+Oracle which is in turn outperformed by TC+ML and DC+ML. This is 
remarkable also considering that such kernel-based approaches can be used in real 
applications while PEM+ Oracle relies on an ideal tuning which exploits the test set. 
Results returned by PEM equipped with CV are instead unsatisfactory. 
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The results outline the importance of regularization, especially in experiments 
with relatively small data sets. In this case, only 300 input—output measurements 
are available with quite complex systems of order 30. The classical PEM approach 
equipped with any model order-selection rule cannot predict the test set better than 
the oracle. However, this latter can tune complexity by exploring only a finite set 
of given models. Kernel-based approaches can instead balance bias and variance 
by continuous tuning of regularization parameters. In this way, better performing 
trade-offs may be reached. 


9.2.2 Real Data: Temperature Prediction 


Now we consider thermodynamic modelling of buildings using some real data taken 
from [22]. Eight sensors are placed in two rooms of a small two-floor residential 
building of about 80 m? and 200m?. They are located only on one floor (approx- 
imately 40m’). More specifically, temperatures are collected through a wireless 
sensor network made of 8 Tmote-Sky nodes produced by Moteiv Inc. The building 
was inhabited during the measurement period consisting of 8 days and samples were 
taken every 5 min. A thermostat controlled the heating system with the reference tem- 
perature manually set every day depending upon occupancy and other needs. This 
makes available a total of 8 temperature profiles displayed in Fig. 9.12. One can see 
the high level of collinearity of the signals. This makes the problem ill-conditioned, 
complicating the identification process. 

We just consider multiple-input single-output (MISO) models. The temperature 
from the first node is seen as the output (y;) and the other 7 temperatures as inputs 


9.2 Identification of ARMAX Models 359 


1-step ahead fit 


T T T T 
got 7 z5 mm TT e TE 
| | 7 | | 
| | | | | 
80 F | | — | 4 
| | | 
70} i | l | J 
| 
eF L | ears 
| | | 
50 | 3 | | 
| ii | H 
4oH | $ | 4 
| | + 
+ 
30+ | + J 
ele 
20 f - 
$ + | 
10H + r | 4 
= | 
0 ¢ L L L | L L L 
PEM+Or TC+ML SS+ML DC+ML PEM+CV PEM+BIC PEM+AIC PEM+AICc 
20-step ahead fit 
80 T T T T T T 
| T T T T | 
| | l | | | T 
| | | | | 
60+ | | 4 
| | 
| | 
40+ 4 
| 
| | | 
29L | | | l 
| | | | 
| | | | | 
ob iL | 
| L | | | 
a | + | | 
+ + | | | 
20 — | | l] 
| | | 
| | | 
-40 1 D L L | | L | 
PEM+Or TC+ML SS+ML DC+ML PEM+CV PEM+BIC PEM+AIC PEM+AICc 


Fig. 9.11 Identification of ARMAX models Boxplots of the 1000 values of F; (top panel) and 
Fag (bottom). Recall that PEM+Oracle uses additional information, having access to the test set 
to perform model order selection 
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(uj ,j=1,...,7). Data are divided into 2 parts: those collected at time instants 
1,..., 1200 form the identification set while those at instants 1201,..., 2500 are 


used for test purposes. With 5 min sampling times, 1200 instants almost correspond 
to 100 h, a rather small time interval. Hence, we assume a “stationary” environment 
and normalize the data so as to have zero mean and unit variance before performing 
identification. Quality of the k-step-ahead prediction on test data is measured by 
(9.10). 

Identification has been performed using ARMAX models with an oracle which 
has access to the test set. This estimator, called PEM+Or, maximizes SE Fy which 
accounts for the prediction capability up to 4 h ahead. The other estimator is regu- 
larized ARX equipped with the TC kernel with a different scale factor A assigned 
to each unknown one-step-ahead predictor impulse response and a common decay 
rate œ. The length of each impulse response is set to 50 and the hyperparameters 
are estimated via marginal likelihood maximization using only the identification 
data. This estimator is denoted by TC+ML. Results are reported in Fig. 9.13 (top 
panel): the performance of PEM+Or and TC+ML is quite similar. Sample trajecto- 
ries of one-hour-ahead test data prediction returned by TC+ML are also reported in 
Fig. 9.13 (bottom panel). 


9.3 Multi-task Learning and Population Approaches x 


In the previous chapters we have studied the problem of reconstructing a real-valued 
function from discrete and noisy samples. An extension is the so-called multi-task 
learning problem in which several functions (tasks) are simultaneously estimated. 
This problem is significant if the tasks are related to each other so that measurements 
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taken on a function are informative with respect to the other ones. An example is 
given by a network of linear systems whose impulse responses share some common 
features. Here, a relevant problem is the study of anomaly detection in homogenous 
populations of dynamic systems [5, 6, 10]. Normally, all of them are supposed to 
have the same (possibly unknown) nominal dynamics. However, there can be a subset 
of systems that have anomalies (deviations from the mean) and the goal is to detect 
them from the data collected in the population. Important applications of multi- 
task learning arise also in biomedicine when multiple experiments are performed in 
subjects from a population [9]. Similar patterns are observed in individual responses 
so that measurements collected in a subject can help reconstructing also the responses 
of other individuals. In pharmacokinetics (PK) and pharmacodynamics (PD) the 
joint analysis of several individual curves is often exploited and called population 
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analysis [24]. One class of adopted models is parametric, e.g., compartmental ones 
[7]. The problem can be solved using, e.g., the NONMEM software, which traces 
back to the seventies [3, 25], or more sophisticated approaches like Bayesian MCMC 
algorithms [15, 28]. More recently, machine learning/nonparametric approaches have 
been proposed for the population analysis of PK/PD data [16, 19, 20]. 

In the machine learning literature, the term multi-task learning was originally 
introduced in [4]. The performance improvement achievable by using a multi-task 
approach instead of a single-task one which learns the functions separately has been 
then pointed out in [1, 26], see also [2] for a Bayesian treatment. Next, in [8] it has 
been proposed a regularized kernel method hinging upon on the theory of vector- 
valued Reproducing kernel Hilbert spaces [18]. Developments and applications of 
multi-task learning can then be found, e.g., in [11, 12, 17, 21, 29, 30]. 


9.3.1 Kernel-Based Multi-task Learning 


We will now see that multi-task learning can be cast within the RKHS setting devel- 
oped in the previous chapters by defining a particular kernel. Just to simplify exposi- 
tion, let us assume that there is acommon input space X for all the tasks and consider 
a set of k functions f; : X > R. Assume also that the following n; input-output data 
are available for each task i 


iis Yii), ai, Vai), «+ nis Yni). (9.11) 


Our goal is to jointly estimate all the unknown functions f; starting from these exam- 
ples. For this aim, first a kernel can be introduced to include our knowledge on the 
single functions (like smoothness) and also on their relationships. This can be done 
by defining an enlarged input space 


X =X x {1,2, ..., k}. 


Hence, a generic element of 2 is the couple (x, i) where x € X whilei € {1,..., k}. 
The index i thus specifies that the input location belongs to the part of the function 
domain connected with the ith function. The information regarding all the tasks 
can now be specified by the kernel K : 2 x 2 — R which induces a RKHS of 
functions f : 2 — R. In fact, we are just exploiting RKHS theory on function 
domains that include both continuous and discrete components. Note that, in practice, 
any function f embeds k functions f;. 

Regularization in RKHS then allows us to reconstruct the tasks from the data 
(9.11) by computing 


k ni 


f= arg iti DOYO Miu fi) + ylti. (9.12) 
i i=1 1=1 
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Under general conditions on the losses %j;, we can then apply the representer theorem, 
i.e., Theorem 6.15, to obtain the following expression for the minimizer: 


k ni 


FoS > ¥ ak (œ, j) Ge) EX, j=l.. k (9.13) 


i=1 /=1 


where {c;;} are suitable scalars. Adopting quadratic losses which include weights 
{of}, i.e., 


— by? 
mem = 2" 


Oqi 


for any a, b € R, a regularization network is obtained and the expansion coefficients 
{c1} solve the following linear system of equations 


n 


[K (r, i), jg, Q)) + 10551; iq] Cli = Yjqs (9.14) 
j=1 1=1 


where q = 1,...,k, j =1,..., mq and 4;; is the Kronecker delta. 


Connection with Bayesian estimation Exploiting the same arguments developed in 
Sect. 8.2.1, the following relationship between (9.13), (9.14) and Bayesian estimation 
of Gaussian random fields is obtained. Let the measurements model be 


Yji = Fj) + eji (9.15) 
where {e ;;} are independent Gaussian noises of variances {o;;}- Define 


Yi = ictal 3 yt = Dyf yA". 
Assume also that {f;} are zero-mean Gaussian random fields, independent of the 


noises, with covariances 
Cov (f;(x), fy(s)) = K(x, i), ($8,4) x,s €X, 


where i = 1,..., k and q = 1, ..., k. Then, one obtains that for j = 1,...,k, the 
minimum variance estimate of f; conditional on yk is defined by (9.13), (9.14) by 
setting y = 1. Furthermore, the posterior variance of fj (x) is 


Var [fj (x)|y*] = Var [f(x] — Cov (£)(x), y“) (Var [y] Cov (£6), y). 
(9.16) 
In the above formula, in view of the independence assumptions, one has 
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where each block V;, belongs to R”*”* and its (l, j)-entry is given by 
Vig (l, Jd) = K (xy, i), (Xjq, q))s 


while X; = diag{o?., Seay op} In addition 
Cov (f(x), y*) = Cov (f) (x), [fi (x11) bs fiXa) r fi (x1) se Anna) 
=[K(G, j), Gu, DD)... K (Œ, J), Gay, 1)) 
-K (Œ, J), Œk; kK)... K (Œ, J), One, K)]. 


Example of multi-task kernel: average plus shift A simple yet useful class of 
multi-task kernels is obtained by defining K as follows: 


K (x1, p), (%2,q)) = AK (x1, x2) + êp À R p (x1, x2) (9.17) 


where & and 22 are two-scale factors that typically need to be estimated from data. 
Such kernel describes each function as the sum of an average function f, hereafter 
named average task, and an individual shift To) specific for each task. Indeed, if 
à = 0 all the functions would be learnt independently of each other. Instead, when 
à = 0 all the tasks are actually the same. The Bayesian interpretation of multi-task 
learning discussed above facilitates also the understanding of this model. In fact, 
once the kernel is seen as a covariance, it is easy to see that, for any i and x € X, 
each task decomposes into - 
fi (x) = f(x) + fix) 


where f and {É} are zero-mean independent Gaussian random fields. 


9.3.2 Numerical Example: Real Pharmacokinetic Data 


Multi-task learning is now illustrated by considering a data set connected with xeno- 
biotics administration in 27 human subjects [20]. Such administration can be seen 
as the input to a continuous-time linear dynamic system whose (measurable) out- 
put is the drug profile in plasma. In any subject, 8 measurements were collected 
at 0.5, 1, 1.5, 2, 4, 8, 12, 24 h after a bolus, an input which can be seen as a Dirac 
delta. Hence, one has to deal with a particular continuous-time system identification 
problem where noisy and direct samples of the impulse response are available. 
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Fig. 9.14 Multi-task learning Xenobiotics concentration data after a bolus in 27 human subjects: 
average curve (thick) and individual curves 


In this experiment, noises are known to be Gaussian and heteroscedastic, i.e., 
their variances are not constant being given by o? = (0.ly; Pi The 27 experimental 
concentration profiles are displayed in Fig. 9.14, together with the average profile. 
In light of the number of subjects, such average curve is a reasonable estimate of the 
average task f. 

The whole data set consists of 216 pairs (x;;, yij), for i = 1,...,8 and j = 
1,..., 27, and is split in an identification (training) and a test set. For what regards 
training, a sparse sampling schedule is considered: only 3 measurements per subject 
are randomly chosen within the 8 available data. We will adopt the multi-task esti- 
mator (9.12) to reconstruct all the continuous-time profiles. In view of the Gaussian 
and heteroscedastic nature of the noise, the losses are defined by 


_ py 
ians ss 


ij 


For what regards the function model, since humans are expected to give similar 
responses to the drug, quite close to an average function, the kernel (9.17) is adopted. 
In addition, it is known that in these experiments there is a greater variability for small 
values of t, followed by an asymptotic decay to zero. This motivates the use of a 
stable kernel to model both the average and the shifts. A model suggested in [20] is 
a cubic spline kernel under the time-transformation 
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Single-task Multi-task 


Fig. 9.15 Multi-task learning Single task (left) and multi-task (right) estimates of some curves 
(thick line) with 95% confidence intervals (dashed lines) using only three data (circles) for each of 
the 27 subjects. The other five “unobserved” data (asterisks) are also plotted. Dotted line indicates 
the estimates obtained by using the full sampling grid 
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which defines (9.17) through the correspondences 


_ h@)h(t)min{h(t),h(t)}  (min{hA(t), h(t)þ? 
= E A ‘ 


K(t, t) = K,(t, 1) 


One can check that this model induces a stable RKHS by using Corollary 7.2. In fact, 
the kernels are nonnegative-valued and the integral of a generic kernel section is 


(“ee min{h(t), h(t)} (min{h(f), rori) dı 
0 2 6 


1 


t+3 
= — —_ | (27t + 81) log( —— 13.5 67.5 
zy + 81) log( 3 y+ t+ ) 


and this result clearly implies 


I ve J ~ CoE) (min{h(t), h(x) 
0 0 


dtdt < œ. 
= 


The initial plasma concentration is known to be zero. Hence, a zero variance virtual 
measurement in t = 0 was added for all tasks. The hyperparameters X and X2 were 
then estimated via marginal likelihood maximization by exploiting the Bayesian 
interpretation of multi-task learning discussed above. 
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Fig. 9.16 Multi-task learning Boxplots of the prediction errors (RMSE) obtained by the single-task 
approach and by the multi-task approach 
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The left and right panels of Fig. 9.15 report results obtained by the single- and 
the multi-task approach, respectively, in 5 subjects. One can see the data and the 
estimated curves with their 95% confidence intervals obtained using the posterior 
variance (9.16). Each panel shows also the estimates obtained by employing the full 
sampling grid. It is apparent that the multi-task estimates are closer to these reference 
profiles. A good predictive capability with respect to the other five “unobserved” data 
is also visible. To better quantify this aspect, let Zf and J ; denote the full and reduced 
sampling grid in the jth subject. Let also 7; = I ÊN I}, whose cardinality is 5. Then, 
for each subject, we also define the prediction error as 


RMSE™“T = ipa (viz — Ê; ij)? 
a 5 


with the single-task RMSE an defined in a similar way. Figure9.16 then reports 
the boxplots with the 27 RMSE returned by the single- and multi-task estimates. 
The improvement on the prediction performance due to the kernel-based population 
approach is evident. 
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